USING BAYESIAN TECHNIQUES WITH ITEM RESPONSE THEORY TO ANALYZE MATHEMATICS TESTS. by MARY MAXWELL

Size: px

Start display at page:

Download "USING BAYESIAN TECHNIQUES WITH ITEM RESPONSE THEORY TO ANALYZE MATHEMATICS TESTS. by MARY MAXWELL"

Vivien Daniels
6 years ago
Views:

1 USING BAYESIAN TECHNIQUES WITH ITEM RESPONSE THEORY TO ANALYZE MATHEMATICS TESTS by MARY MAXWELL JIM GLEASON, COMMITTEE CHAIR STAVROS BELBAS ROBERT MOORE SARA TOMEK ZHIJIAN WU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Mathematics in the Graduate School of The University of Alabama TUSCALOOSA, ALABAMA 2013

3 ABSTRACT Due to the cost of a college education, final exams for college level courses fall under the category of high-stakes tests. An incorrectly measured assessment may result in students paying thousands of dollars toward retaking the course, scholarships being rescinded, or universities with students taking courses for which they are not prepared, as well as many other undesirable consequences. Therefore, faculty at colleges and universities must understand the reliability of these tests to accurately measure student knowledge. Traditionally, for large general education courses, faculty use common exams and measure their reliability using Classical Test Theory (CTT). However, the cut off scores are arbitrarily chosen, and little is known about the accuracy of measurement at these critical scores. A solution to this dilemma is to use Item Response Theory (IRT) models to determine the instrument s reliability at various points along the student ability spectrum. Since cost is always on the mind of faculty and administrators at these schools, we compare the use of free software (Item Response Theory Command Language) to generally accepted commercial software (Xcalibre) in the analysis of College Algebra final exams. With both programs, a Bayesian approach was used: Bayes modal estimates were obtained for item parameters and EAP (expected a posteriori) estimates were obtained for ability parameters. Model-data fit analysis was conducted using two well-known chi-square fit statistics with no significant difference found in model-data fit. Parameter estimates were compared directly along with a comparison of Item Response Functions using a weighted version of the root mean square error (RMSE) that factors in the ability distribution of examinees ii

4 resulting in comparable item response functions between the two programs. Furthermore, ability estimates from both programs were found to be nearly identical. Thus, when the assumptions of IRT are met for the two- and three-parameter logistic and the generalized partial credit models, the freely available software program is an appropriate choice for the analysis of College Algebra final exams. iii

5 DEDICATION I dedicate this dissertation to my father, Dr. Michael F. Patton, for many reasons, but most importantly because he showed me that the pursuit of knowledge is a worthwhile, life-long endeavor. iv

6 LIST OF ABBREVIATIONS AND SYMBOLS 1PL 2PL 3PL CFA CIs CSEM CTT D EAP EFA EM GPC ICC ICL IRF IRM IRT JMLE MAP MLE One-parameter logistic, as in the one-parameter logistic model Two-parameter logistic, as in the two-parameter logistic model Three-parameter logistic, as in the three-parameter logistic model Confirmatory factor analysis Confidence intervals Conditional model-predicted standard error of measurement Classical Test Theory D = 1.7, a scaling constant sometimes used in logistic models Expected a posteriori Exploratory factor analysis Expectation-Maximization, as in the EM algorithm Generalized partial credit, as in the generalized partial credit model Item characteristic curve Item Response Theory Command Language Item response function Item response model Item Response Theory Joint Maximum Likelihood Estimation Maximum a posteriori Maximum Likelihood Estimation v

7 MMLE OCC R PC RMSE SEM TCC Marginal Maximum Likelihood Estimation Operating characteristic curve A language and environment for statistical computing from the R Foundation for Statistical Computing Partial credit, as in the partial credit model Root mean squared error Standard error of measurement Test characteristic curve W-RMSE Weighted root mean squared error θ A latent trait or ability vi

8 ACKNOWLEDGMENTS I would like to thank Dr. Jim Gleason for providing me with an opportunity to write a dissertation related to a topic about which I am passionate. He has been a patient and thoughtful mentor throughout the process. I would also like to acknowledge the other members of my dissertation committee: Dr. Stavros Belbas, Dr. Robert Moore, Dr. Sara Tomek, and Dr. Zhijian Wu. I am grateful to each of you for agreeing to be a part of this project. Many members of the UA Mathematics Department contributed to the success of my graduate career. I thank Dr. Zijian Wu and Dr. Vo T. Liem, in particular, for accepting me into the program and encouraging me throughout the process. I enjoyed and benefited from every course in which I enrolled, and I am grateful to the dedicated mathematics faculty. I also thank the staff of the MTLC and the UA Mathematics Department. I had the opportunity to work with many instructors, and all were enthusiastic and happy to share their knowledge and experience with me. In addition, the office staff were more than willing to answer my many questions, or find someone who could do so. I owe a great deal to my undergraduate advisor at the University of Montevallo, Dr. Michael Sterner, whose many courses provided me with a a firm foundation for graduate work and a glimpse into the secret garden of mathematics. Finally, I want to acknowledge my family members and friends whose constant reassurance helped me believe that I could achieve this goal. vii

9 CONTENTS ABSTRACT ii DEDICATION iv LIST OF ABBREVIATIONS AND SYMBOLS v ACKNOWLEDGMENTS vii LIST OF TABLES xi LIST OF FIGURES xiii 1. INTRODUCTION Purpose of Study and Research Questions Methodology Significance of the Study Limitations Organization of the Document OVERVIEW OF ITEM RESPONSE THEORY Comparison of Classical Test Theory and Item Response Theory Item Response Models Parameter Invariance Parameter Estimation The Bayesian Approach MAP Estimates of Item Parameters EAP Estimation of Ability Parameters viii

10 2.6. Information Functions The Ability Scale Model-Data Fit Unidimensionality and Local Independence Chi-Square Goodness of Fit Indices Other Considerations Overview of Parameter Estimation Software: ICL and Xcalibre METHODS The Instrument: College Algebra Final Exam Unidimensionality and Local Independence Item and Ability Parameter Estimation Comparison of Parameter Estimates Model-Data Fit RESULTS Unidimensionality and Local Independence Model-Data Fit Item Parameter Estimates Comparison of Item Parameter Estimates Comparison of Item Response Functions Ability Parameter Estimates DISCUSSION Unidimensionality and Local Independence ix

11 5.2. Model-Data Fit Item Parameter Estimates Ability Parameter Estimates Implications for Classroom Test Development Conclusions and Future Research REFERENCES APPENDICES A. ICL and Xcalibre Program Files B. Results from Confirmatory Factor Analysis C. Model-Data Fit: χ 2 Fit Statistics D. Item Parameter Estimates E. Item Response Functions x

12 LIST OF TABLES 2.1 Domain Score π Two-Index Presentation Strategy Hu and Bentler (1999) Confirmatory Factor Analysis: All Items Confirmatory Factor Analysis: Item 6 Removed Items of Interest Item Parameter Estimates: Standard Errors Item Parameter Estimates: RMSE Items with RMSE Ability Estimates: Frequency Distribution Ability Estimates: Summary Statistics A.1 Control File B.1 Factor Loadings Item 6 Removed: Items B.2 Factor Loadings Item 6 Removed: Items B.3 Unique Variances Item 6 Removed: Items B.4 Unique Variances Item 6 Removed: Items B.5 Standardized Coefficients Item 6 Removed: Items B.6 Standardized Coefficients Item 6 Removed: Items B.7 Modification Indices: A Matrix B.8 Modification Indices: P Matrix C.1 Xcalibre Fit Statistics xi

13 C.2 Item Misfit C.3 Q 1 Statistic: Items C.4 Q 1 Statistic: Items C.5 Q 1 Statistic: Items C.6 S χ 2 Statistic D.1 Summary Statistics: All Items D.2 3PL Parameter Estimates and Standard Errors D.3 2PL Parameter Estimates and Standard Errors D.4 GPC a Parameter Estimates and Standard Errors D.5 GPC b i Parameter Estimates and Standard Errors D.6 Summary Statistics by Item Type E.1 RMSE and W-RMSE for ICL and Xcalibre IRFs xii

14 LIST OF FIGURES 2.1 ICC: a = 0.76, b = 0.21, and c = OCC for a 3-Category Item: b j1 = 1 and b j2 = OCC for a 3-Category Item: b j1 = 1, b j2 = GPCM Example Similar ICCs with Different Parameter Estimates The Effect of the a Parameter on Item Information PL and 3PL Items with Similar a Parameters Information for a GPC and a 2PL Item Test Information Test Characteristic Curve Scree Plot Misfitting Item: Q 1 Statistic RMSE and W-RMSE Scree Plot: All Items Scree Plot: Item 6 removed Item Item Item Item Item xiii

15 4.8 Item Item Item Item 16: Response Category Two Item 40: Response Category Zero PL Items: Parameters Outside Xcalibre CIs (1 of 2) PL Items: Parameters Outside Xcalibre CIs (2 of 2) Item Item 2 and Item Frequency Histograms of Parameter Estimates: 3PL Items PL Items with RMSE GPC Items with RMSE Test Information and SEM Ability Distributions: Joint Histogram Test Response Functions: RMSE = Prior Distributions for the c Parameters E.1 Items 1 to E.2 Items 5 to E.3 Items 11 to E.4 Items 17 to E.5 Items 23 to E.6 Items 29 to xiv

16 E.7 Items 35 to E.8 Items 41 to E.9 Items 47 to xv

17 CHAPTER 1 INTRODUCTION While students enrolled in courses at the university level have diverse expectations, every student expects that the grades earned will accurately reflect his or her ability. An emphasis on grades may seem misplaced to faculty who prefer to focus on the knowledge that is attained and the progress that is made by the student. However, this grade is the single statistic used to measure the student s ability and with it come consequences critical to the student s and the University s success. For the student, these consequences vary according to the course and depend somewhat on the student s personal situation, but one serious outcome for any student is inherent in general education courses where not passing, or not achieving some minimum grade, results in a student s having to retake a course. Not only are thousands of dollars in tuition at stake, but graduation may be delayed by a semester if not an entire academic year. Tuition rates are at an all-time high, and an increasing number of students rely on financial aid in the form of loans and/or scholarships that are almost always affected by a student s grades. In addition, general education courses are often prerequisites. Having an unprepared student move on to a subsequent course results in a frustrating situation for both the student and the professor, in which the student s likelihood of failure is greatly increased. Grades received in general education courses also have serious consequences for the University. General education courses are under scrutiny and are often assessed by the 1

18 administration for accountability reasons. For example, a student passing a course, but failing an assessment, points to a lack of accountability and to possible grade inflation. Low pass rates, especially in introductory courses, translate into decreased retention rates. Therefore, any assessment that significantly impacts a student s grade should be treated as a high-stakes test. Final exams typically fall into this category. Ideally, all assessments should accurately measure a student s ability and yield meaningful results; i.e., assessments should be reliable. When a test falls into the category of high-stakes, reliability is crucial. In general education courses at large universities, there is a growing trend of using common exams composed of questions chosen from a test bank provided by the text book publisher and graded based on traditional pre-set cut off scores. The final exams in these courses may account for thirty to fifty percent of a student s grade. These are high stakes tests that raise two critical issues: How are questions chosen, and what is known about the accuracy of the cut off scores? The decision to include a particular question is, in part, a subjective one based on the instructor s opinion and/or previous experience. Some analysis always takes place. Bad questions are often identified, removed, and replaced with better ones. But what characterizes a bad question or a good one? A flawed question could be one that almost every student answers correctly. Such a question is often included at the beginning of a test so that a student relaxes, yet it reveals little information about a student s ability level. The number of such questions should be limited. Another type of flawed question is one that is answered incorrectly by students with high ability and correctly by those with low ability. Such a question may have faulty distractors (incorrect choices) or incorrect wording. The 2

19 first type of flawed question, the too easy one, is readily identified by inspection of exam results. The second type is not. Examples of good questions are ones that discriminate well at some key ability level, for example, at the pass/fail or the A/B cut off score. Again, one of these types of questions, the latter, is easily identified while the other is not. Instructors often have neither the time, nor the background, to conduct a thorough analysis of exam results. When analysis does take place, it is generally rooted in Classical Test Theory (CTT), the traditional approach to test design and analysis, where it is assumed that the score a student achieves on a test, X, is an estimate of that student s true ability, T, plus measurement error, E. A question s difficulty is based on the proportion of students who answer it correctly, and its ability to discriminate is determined by how well it correlates with the test score. Both of these characteristics are dependent upon the population being tested and/or the test itself. Little is known about the accuracy of the traditional cut off scores used for assigning grades that are based solely on the number of correctly answered questions. Item Response Theory (IRT), which has been used for many years in the development and analysis of tests such as the ACT or GRE, provides an alternative to Classical Test Theory. In both IRT and CTT, an item refers to a question on a test. When used correctly, IRT analysis quickly identifies good and bad items, such as those discussed above. The analysis provides test designers with specific information about each item s difficulty and its level of discrimination that is neither test nor population dependent. With IRT analysis, test development becomes a more meaningful and systematic process, and choosing cut off scores becomes less arbitrary. 3

20 Instructors may wish to reduce the number of items required for an assessment. Consider a college algebra final exam consisting of fifty multiple choice and free response items. An exam of this length is a daunting prospect for the student, and its administration is a time consuming exercise. As enrollment and class sizes continue to grow, the administration of these exams becomes increasingly difficult. The results of IRT analysis distinguish between items that provide information about students ability levels and those that contribute little or nothing to the assessment. The items that are of little value are candidates for removal, and a reduction in test length translates into time and money savings. Instead of basing a student s grade on a number correct score, IRT produces an estimate of a student s ability that factors in the difficulty of the correctly answered questions resulting in less arbitrary cut off scores. Commercial and free IRT software packages that run on desktop computers are available, and the Institute for Objective Measurement maintains a web page containing a comprehensive listing that is frequently updated [30]. Several of these software programs are capable of IRT analysis of college level algebra exams in use at large universities. For this research, the focus was on a Bayesian approach to IRT analysis. Xcalibre is a commercial software package with Bayesian capabilities that can analyze mixed format tests: those with multiple choice, free response, and partial credit questions [19]. Item Response Theory Command Language, or ICL, is a freely available program with similar capabilities [26]. The goal of this study was to determine whether ICL provides results comparable to Xcalibre, thus making it a good choice for educators and researchers who would like to 4

21 conduct IRT analyses, but who may lack funding for commercial software. A college algebra final exam at a large university was analyzed by both programs, and the results from both programs were compared to determine their differences and similarities Purpose of Study and Research Questions The purpose of this study was to determine whether a freely available software program, Item Response Theory Command Language (ICL), produces results comparable to a wellknown commercial software program, Xcalibre, using a college algebra final exam [27] [19]. An item is a question on a test, and in both IRT and CTT, an item is characterized by one or more parameters such as those that indicate its level of difficulty or its degree of discrimination. In addition, in IRT, an item is completely characterized by the graph of its Item Response Function (IRF). The research questions underlying this study were: 1. Does the college algebra final exam meet the assumption of unidimensionality which is necessary for IRT analysis? 2. Are the results from ICL equivalent to those of Xcalibre? a. Do the programs differ in terms of model-data fit? b. Do the programs produce item parameter estimates that are equivalent? c. Do the programs produce equivalent results in terms of the graphs of the Item Response Functions? d. Do the programs produce equivalent estimates of examinee ability levels? 3. What are the differences between Xcalibre and ICL? 5

22 1.2. Methodology The first step in the study was establishing one of the primary assumptions of Item Response Theory: unidimensionality. An exam is unidimensional provided there is one primary factor influencing an examinee s probability of a correct response. (Note that unidimensionality is generally assumed for any course in which a student s ability level is represented by a single grade.) Next, the item and ability parameter estimates from Xcalibre and ICL were obtained, and the results were compared using the following methods: 1. Model-data fit analysis was conducted using two well known chi-square goodness of fit statistics: Q 1 and S χ 2 [23] [40]. Differences in fit were further investigated graphically, quantitatively and qualitatively. 2. Default settings for parameter estimation were used for ICL, while some settings were adjusted slightly in Xcalibre to better match the ICL defaults. 3. Item parameter estimates were compared directly by creating confidence intervals for the Xcalibre results based on the standard errors of its parameter estimates. 4. Item parameter distributions were compared using summary statistics and graphical techniques. 5. The Item Response Functions for all items were compared graphically. The root mean squared error (RMSE) and a weighted version of the root mean squared error (W-RMSE) that factors in the ability distribution of examinees were used to compare the IRFs quantitatively. 6. The Test Response Functions, Test Information Functions, and the standard error of measurement (SEM) were all compared graphically as well as quantitatively 6

23 using the RMSE. (Note, the SEM is also known as the conditional model-predicted standard error of measurement (CSEM) [19]). 7. Ability estimates were compared quantitatively by calculating the actual difference in these values for each examinee. 8. The resulting ability distributions of the population were compared graphically and directly via frequency distributions and summary statistics Significance of the Study The techniques of Item Response Theory can be used by faculty and researchers to improve the design and evaluation of high stakes tests such as college algebra final exams. This study was significant as it demonstrated that a freely available software package is an appropriate choice for IRT analysis, making this analysis available to faculty and researchers who may lack funding for commercial software Limitations Some of the limitations of this study were: 1. Only one college algebra exam was analyzed using the results of one population of examinees. 2. The true item and ability parameters were unknown. 3. The exam was determined to be unidimensional, but only to a certain degree, which was accepted as sufficient for proceeding with IRT analysis. It is quite likely that the exam contained several redundant items that affected the unidimensionality. 7

24 Another confounding factor was the obvious ordering of the items (items were presented on the final exam in the order they were presented in the classroom). 4. When answered incorrectly, the option chosen (or distractor) was unknown for multiple choice items. While not necessarily a limitation, more analysis would have been possible with this data. 5. The default settings were used for ICL. Other options are available which would have resulted in different parameter estimates Organization of the Document Chapter 1 provides motivation for the study and outlines its purpose, research questions, methodology, significance, and limitations. Chapter Two begins with an overview of Classical Test Theory and contrasts this traditional approach with Item Response Theory. Classical Test Theory is a relatively simple approach to test analysis that relies on few assumptions. Item Response Theory is a more complex approach requiring several strong assumptions. Chapter 2 provides details of the assumptions, the mathematical theory, and some models that underlie IRT. The Bayesian approach to IRT has become more popular recently due to the increased computing power of personal computers. Bayesian methods are discussed in this chapter along with one of the parameter estimation algorithms currently in use, the EM algorithm. The research methods are found in Chapter 3, followed by the results of the study in Chapter 4, where the emphasis was on determining the significant differences in the IRT analyses provided by the two software programs. Chapter 5 contains a discussion of the results along with conclusions and future research possibilities. Additional information 8

25 can be found in the Appendices. Some of the software program files are in Appendix A, and the remaining appendices contain supplementary tables from the confirmatory factor analysis, complete tables of model-data fit statistics, parameter estimates, the graphs of Item Response Functions for all items, and complete tables of the RMSEs and W-RMSEs for each item s pair of IRFs. 9

26 CHAPTER 2 OVERVIEW OF ITEM RESPONSE THEORY 2.1. Comparison of Classical Test Theory and Item Response Theory Using an assessment to measure an unobservable ability is a familiar concept and the subject of much research and debate. Surveys and opinion polls are designed to measure vague concepts like happiness or political inclination often using a Likert scale where the possible responses are categories: Strongly Disagree, Disagree, No Opinion, Agree, Strongly Agree. Medical professionals use clinical assessments to diagnose patients, political parties try to measure public opinion with polls, and in the academic setting, surveys are used for students evaluation of courses and to obtain faculty/staff feedback on university policies. Tests are used by the government to obtain a driver s license or citizenship. Most common perhaps, is the test aimed at assessing a student s ability encountered in all areas of education. Such tests range from simple quizzes, to high stakes tests that determine, for example, whether a child moves on from the fourth to the fifth grade. In higher education, semester and final exams that are the primary method of assessment in a course are certainly highstakes tests. These exams are frequently encountered in, but not limited to, general education courses. In fact, they are often encountered in graduate level courses, as in a Master of Education program. The importance of being able to accurately measure the ability in question is undeniable, as results have grave consequences: the awarding of scholarships, graduating from high school, assigning a grade, obtaining a driver s license, or earning a 10

27 degree from a university. Regardless of the type of assessment used, it should be both reliable and valid. A reliable test yields accurate measurements and produces consistent results over time and across different populations. A valid test measures what it was designed to measure. Note that a test may be reliable but invalid [46]. Classical Test Theory (CTT) is an approach towards test data analysis that has been researched and used for over sixty years. However, as far back as 1953, F.M. Lord made a critical observation: He noted that in CTT, an examinee s true ability score is not necessarily reflected by the observed test score, or even by the so called true test score (where adjustments for measurement error have been taken into account)[22]. One reason this occurs is because a student will score higher on a less difficult test. Thus, this ability score is dependent on the characteristics of the test whereas, in reality, the ability is not test dependent, but is fixed, at least temporarily [22]. This observation highlights one of the weaknesses of Classical Test Theory that Item Response Theory, a relatively new approach, can address. In both Classical Test Theory (CTT) and Item Response Theory (IRT), a test is called an instrument and is any type of assessment used in the academic or non-academic setting. An instrument is composed of items, i.e, questions. This terminology will be adopted throughout this paper (the term test, however, will also be used). One characteristic, and advantage, of CTT is that it is based on weak assumptions from which many useful statistical measures follow. The first assumption is that the score a student earns on a test, X, is an estimate of the student s true score, T, plus some measurement error, E, so that X = T + E. There are two unknowns in this equation, the true score, T, and the error measurement, E, that cannot be measured directly. Another important assumption is that the standard measurement of 11

28 error, E, is normally distributed with an average value of zero. A consequence of this assumption is that the covariance of T and E is zero, which means σx 2 = σ2 T + σ2 E [21]. In Classical Test Theory, the true test score, T, is the expected value of observed performance on the test of interest [23, p.2] and is measured by the score achieved on a test. But this score depends on more than the examinee s ability. Faculty who write tests often realize afterwards that the test they have administered was too easy adjustments are usually made for subsequent semesters. Other times, the realization is that the test was too difficult, and the faculty member may make immediate adjustments to the grades based on his or her interpretation of the scores. In other words, the score an examinee earns is dependent upon the difficulty of the test. If no adjustments are made, an examinee s ability level may not be measured accurately. Consider two sections of the same mathematics course at a university taught by two different professors, X and Y. Professor X is known for writing more difficult tests than Professor Y. Students with high ability are likely to earn the same grade from either professor, as are students of very low ability. However, a grade of C from Professor X most likely indicates a higher ability level than a similar grade from Professor Y. This example highlights the fact that there is often more error of measurement associated with examinees of average ability. Mathematically, this can be illustrated by considering one definition of the standard error of measurement in CTT. Consider a test with n dichotomously scored items, and let X denote number correct. Then the standard error of measurement is defined as X(n X) σ e =, (2.1.1) n 1 12

29 and σ e is at a maximum for examinees who answer half the items correctly and lowest for those at either extreme [9]. Thus, often error is highest in the middle range of abilities where most examinees are found, which is a drawback of CTT that IRT addresses. In CTT, the focus is on the test as a whole and on the true score, T, but the items are also analyzed in terms of their difficulty level and level of discrimination. The difficulty of an item, denoted p, is calculated using the percentage of examinees in the given sample who answer it correctly, making the difficulty of an item a function of the sample population. An item s level of discrimination, denoted r, can be estimated using the point-biserial correlation between the item score and the total test score. (This correlation is comparable to the Pearson product-moment correlation and is appropriate when one variable is dichotomous, as in a test with correct/incorrect responses and another is continuous, e.g., ability level.) For a test with dichotomous items, the point-biserial correlation for one item is r pb = X 1 X 0 s X p(1 p), (2.1.2) where X 1 is the average score of those examinees who answered the item correctly and X 0 is the average score of those who answered the item incorrectly. The term in the denominator, s X is the standard deviation of the score of the all the examinees, and p is the proportion of those who answered the item correctly. Note that value of r is dependent on the test and the population [22]. Classical Test Theory is still widely used and can be the correct choice. It has the advantage of being based on weak the assumptions, such as the error, E, is normally distributed with mean zero and is uncorrelated with the test score, X, and the error from any 13

30 other test. However, it is clear that in CTT, the ability score is test-dependent, and the item characteristics are both sample- and test-dependent. Further, the CTT model does not allow test designers to predict how an examinee at a particular ability level will respond to a given item. Information based at the item level is useful. Some tests need to discriminate at a particular ability level, as in the awarding of scholarships or in professional certification exams where it only matters if an examinee scores above or below a certain level [23]. Another drawback of CTT s test level focus is that the removal or addition of items completely changes the test, requiring its reevaluation. Item Response Theory addresses each of these issues. Another key observation made by Lord in 1980 is that because the examinee s true ability, T, and the error score, E, are both unknown, the equation X = T + E cannot be disproved by the data collected, i.e., the question of model-data fit is not addressed [8]. In contrast, in Item Response Theory, the question of model fit is critical and is examined closely. Item Response Theory can be described as a general statistical theory about examinee item and test performance and how performance relates to the abilities that are measured by the items in the test [22, p. 255]. This theory is embodied in many mathematical models that have have been created and are still being developed for different types of testing situations. In IRT, the ability being measured is called the latent variable, and IRT can be viewed as a system of models that defines one way of establishing the correspondence between latent variables and their manifestations [8, p. 4], which are the observed responses to items on a test. 14

31 Item Response Models (IRMs) focus on the item and the probability of a correct response to the item. The mathematical models incorporate a varying number of parameters, but one parameter all models have in common is the latent trait being measured, which is typically called the ability parameter and denoted by θ. A unidimensional model is appropriate when the researcher believes that only one underlying ability influences an examinee s response to an item, and when the unidimensionality can be established to a reasonable degree. This assumption underlies many testing situations where one grade is assigned as a single measure of ability. Unlike Classical Test Theory, IRT has strong assumptions [23]. For a unidimensional model, one with a single dominant factor influencing an examinee s responses to an item, these assumptions are: Only one unidimensional latent variable, θ, affects the probability of a correct response to an item. When the assumption of unidimensionality holds, the responses to a pair of items are statistically independent (this assumption is known as local independence). The probability of a correct response to an item is characterized by the Item Response Function, or IRF. A positive change in θ leads to a increase in the probability of a correct response to an item. Estimates of item parameters are invariant across differing tests and differing sample populations. Estimates of ability parameters differ only to measurement error. 15

32 In the case of a typical final exam in a college setting, there are other factors are involved: motivation, test anxiety, general cognitive skills, the ability to work quickly, etc, but these factors are assumed to play a relatively minor role. To say that the responses to a pair of items is statistically independent means that after taking into account the ability level of the examinee, the responses to the different items are in no way related. In other words, the responses are conditionally independent given the ability level. Some assessments are not one dimensional. A clinician may use an instrument that measures a person s ability to adopt and adhere to a healthy diet. This ability is affected by the person s actual knowledge (a cognitive ability) of healthy food choices and the person s motivation to adopt such a diet (an affective dimension). In this case, the two dimensions may not interact. In contrast, a complicated word problem in mathematics may be twodimensional: one factor is reading comprehension and the other is mathematical ability. In this case, the two factors do interact. Multidimensional models are treated as either non compensatory or compensatory, based on the interaction of the abilities being measured [8]. Multidimensional IRT (MIRT) models are becoming more widely used, see [54], for example. The assumption of local independence can hold even when unidimensionality does not, provided the dominant factors are specified and can be conditioned on [23]. Local independence can be described mathematically as follows [15]. Suppose we have an instrument with J items. Assume unidimensionality and let θ i represent the ability being measured of examinee i. Let y i be the random vector representing the J responses of examinee i. Then y i = (y 1, y 2,, y J ) represents a particular response pattern. The 16

33 probability of this response pattern given the ability level θ i is J p(y i θ i ) = p(y i1 θ i )p(y i2 θ i ) p(y ij θ i ) = p(y ij θ i ). (2.1.3) j=1 Suppose a test consists of 3 items, and denote a correct response by a 1 and an incorrect response by a 0. Then y i = (1, 1, 0) represents the response pattern of examinee i who answered items 1 and 2 correctly and item 3 incorrectly. Under the assumption of local independence, the probability that examinee i would have such a response pattern given an ability level of θ i is p(y i = (1, 1, 0) θ i ) = p(y 1 = 1 θ i )p(y 2 = 1 θ i )p(y 3 = 0 θ i ). (2.1.4) The topic of unidimensionality and local independence is furthered explored in section and is closely related to model-data fit. The assumptions of Item Response Theory are more difficult to meet than those of Classical Test Theory. In CTT, one assumes that X = T + E and that the mean of E is zero, while in IRT, researchers attempt to verify the assumptions of unidimensionality and local independence. The assumptions of IRT are strong, while those of CTT are weak. If these assumptions are verified, then a researcher must choose the model or models that he or she believes will best fit the data. In IRT, as in CTT, items are characterized by parameters. In the simplest dichotomous model, the Rasch model, there is one item parameter: difficulty. There are many models available with more being developed, and each model has one or more item parameters: difficulty, discrimination and/or guessing parameters, for example. 17

34 Polytomous models, like the partial credit models, have a varying number of parameters, depending on the nature of the item. Details of IRT models used in this study are provided Section 2.2. After specifying the models, the researcher must decide what parameter estimation technique is appropriate, obtain the estimates, and conduct model-data fit analyses [23]. The details of some of the most common item response models and their item response functions are introduced in next section. The remainder of this chapter provides an overview of many aspects of Item Response Theory, including parameter estimation techniques employing Bayesian strategies, methods of assessing model-data fit, and details of the two software programs used in this study Item Response Models The most widely used IRM is the one-parameter logistic model, 1PL-model, commonly known as the Rasch model. Technically speaking, the 1PL-model differs from the Rasch model. In both models, items are assumed to have the same level of discrimination, but in the Rasch model is this value restricted to 1. The difficulty parameter, b, is the one parameter being estimated (in addition to the ability parameter θ). Although it is still widely used, the one-parameter model is limited since it does not take into account the different levels of item discrimination. (Sometimes it is desirable to have all items of equal discrimination; for example, this is often the case in computer adaptive testing situations.) The two-parameter logistic, 2PL, model has a difficulty parameter b and an item discrimination parameter, a. 18

35 The two parameter model was originally developed based on the cumulative normal distribution, however, logistic functions are easier to work with [2]. By incorporating a scaling constant D = 1.7, the two parameter normal ogive and the two-parameter logistic models differ in absolute value by less than 0.01 for all values of θ [23, p. 15]. The three-parameter logistic model incorporates a third parameter, c, often called the guessing parameter. All three of these models are unidimensional models appropriate for dichotomously scored tests (items that are scored as either correct or incorrect). If a multiple choice test with j = 1, 2,, J items is administered to n = 1, 2, N examinees, the three parameter logistic, or 3PL, model gives the probability of a correct response to item j by examinee n as exp(da j (θ n b j )) p(y nj = 1 θ n, a j, b j, c j ) = c j + (1 c j ) 1 + exp(da j (θ n b j )), (2.2.1) where a j is the item discrimination parameter, b j is the item difficulty parameter, c j is the guessing parameter, and D = 1.7 is the scaling constant used to align this logistic model with the model based on the cumulative normal distribution (normal ogive) [23, p. 14]. The third parameter, c j, can be viewed as an adjustment for guessing on multiple choice tests. Fixing the appropriate parameters of the three-parameter model yields the two- and one-parameter models. Setting c j = 0 results in the two-parameter model, and setting both c j = 0 and a j equal to a constant results in the one-parameter model. In Item Response Theory, an item is characterized by its Item Response Function, IRF. When a dichotomous model is used, the graph of the IRF is often called the Item Characteristic Curve or ICC. Again suppose an instrument has J items. Let p j (θ) denote the 19

36 Figure 2.1. ICC: a = 0.76, b = 0.21, and c = 0.23 probability that randomly chosen examinee answers item j correctly. For j = 1, 2, 3, J, the Item Characteristic Curve for the three-parameter model is given by exp(da j (θ b j )) p j (θ) = c j + (1 c j ) 1 + exp(da j (θ b j )). (2.2.2) The graph in Figure 2.1 represents the ICC for an item with a = 0.76, b = 0.21, and c = In the 2PL-model, the difficulty parameter, b j, represents the point on the ability scale where the probability of a correct response to item j is 0.50 and in 3PL-model, the probability of a correct response at θ = b j is 1 + c j. To see this, note that when θ = b j in the 2PL-model, 2 p(θ = b j ) = exp(da j(b j b j )) 1 + exp(da j (b j b j )) = = 0.5, (2.2.3) 20

37 and in the 3PL-model, exp(da j (b j b j )) p(θ = b j ) = c j + (1 c j ) 1 + exp(da j (b j b j )) = c j + (1 c j) 2 = 1 + c j. 2 (2.2.4) If the ability scale representing θ is based on a normal distribution with a mean of zero and a standard deviation of one, then typically 3 θ 3 and 2 b j 2 with smaller values of b representing less difficult items. An item that functions well would be one whose b parameter is located in a useful region of θ. In other words, test designers want items that have b parameters that match the ability distribution of the population and that are located at certain key cut off points, and the number of very easy (low b values) and very difficult (high b values) items should be limited. For example, a difficulty level of b = 6, when there are no examinees of such low ability, contributes nothing to the ability estimates of the population. In theory, a j. However, in most cases an item with a negative discrimination is flawed and would be discarded. (This could be an item with an incorrect answer, for example, where examinees with high ability might answer it incorrectly, while those with low ability might guess the correct incorrect answer.) Note that the value of a j is proportional to the slope of the ICC at the inflection point θ = b j. For in the 2PL model, p (θ) = Da exp(da(θ b j) (1 + exp(da(θ b j )) 2 (2.2.5) 21

38 and so p (b j ) = Da 4. (2.2.6) In the 3PL model, p (b j ) = (1 c j ) Da 4. (2.2.7) An item with perfect discrimination would result in an undefined slope. When an item discriminates well, it has a higher value of a (typically, a should be close to one or higher), and in general a will not be negative. However, even when an item has an appropriate a value, it is only useful if it discriminates well at an appropriate location on the ability scale; i.e., the item must also possess an appropriate value for the b parameter. The parameter c affects the lower asymptote of the ICC. In the one- and two-parameter models, this lower asymptote is p j (θ) = 0, reflecting the almost zero possibility of an examinee of very little ability answering the item correctly. In the three parameter model, the lower asymptote is p j (θ) = c j. For illustration, consider a multiple choice item with four possible answers where the distractors (incorrect choices) are equally likely to be chosen by an examinee of low ability. Here a value of c = 0.25 reflects that an examinee of low ability has a 25% chance of answering the item correctly. The c parameter value provides information regarding the quality of the distractors. If this same item had a c value much greater than 0.25, then some distractors are being eliminated quite easily by those who are simply guessing, and a parameter estimate of c < 0.25 would be indiciative of distractors that are functioning extremely well. 22

39 Of course some items cannot be scored dichotomously, such as tests with multi-step questions where partial credit is assigned for partially correct answers. These items require polytomous models. Although they require more item parameter estimations, polytomous models can provide more accurate information regarding the latent variable [15]. Two widely used models are the Partial Credit Model and the Graded Response Model. The Partial Credit Model (PC Model), does not require the difficulty at each step of an item to be ordered whereas the Graded Response Model requires an ordering (increasing or decreasing) of the step difficulty parameters. The Graded Response Model is more appropriate for items where the response options are on a scale, for example, a Likert scale. To illustrate the PC model, consider the following simple example from [8]. This model was developed by G. N. Masters (1982) and is based on the 1PL-model. The level of discrimination is assumed to be constant, i.e., a = 1. Of course, the other basic assumptions of IRMs must hold (unidimensionality, local independence). Suppose the item states: (2 + 3)/10 =?. It is possible to score this item as if it consists of two separate steps. First an examinee must correctly perform the addition, followed by correctly performing the division. A correct response at each step is denoted by 1 and an incorrect response by 0. Let y j = 0 be the outcome that the examinee answers no part correctly, y j = 1 be the outcome where the examinee only performs the addition correctly, and let y j = 2 be the outcome where the examinee answers the entire question correctly. These three possible outcomes, y j = {0, 1, 2}, are called category scores. It is assumed that there exists a point (often called a location) on the ability scale, b j1, that represents the transition between scoring y j = 0 and y j = 1. At θ = b j1 and examinee has a equal chance of scoring a 0 or a 1. Below b j1, the probability of 23

40 scoring y j = 0 is greater than 0.5, and,likewise, above b j1, the probability of scoring y j = 1 is greater than 0.5. Similarly, there is a point b j2 representing the transition between the likelihood of scoring y j = 2 and y j = 1. These transition points break up the item so that a dichotomous model can be used for each pair (0, 1) and (1, 2). The Partial Credit Model due to Masters [2] is given by p(y j θ, b jh ) = exp ( y j h=0 (θ b jh) ) exp(0) + m j k=1 exp ( k h=0 (θ b jh) ) = exp ( y j h=0 (θ b jh) ) ( mj k=0 exp k ), (2.2.8) h=0 (θ b jh) where 0 h=0 (θ b jh) = 0 and where m j represents the number of categories for item j. The probabilities of each of the outcomes y j = {0, 1, 2} are given by p(y j = 0 θ, b jh ) = exp(0) exp(0) + exp (θ b j1 ) + exp (θ b j1 ) exp (θ b j2 ), (2.2.9) p(y j = 1 θ, b jh ) = exp (θ b j1 ) exp(0) + exp (θ b j1 ) + exp (θ b j1 ) exp (θ b j2 ), (2.2.10) and, p(y j = 2 θ, b jh ) = exp (θ b j1 ) exp (θ b j2 ) exp(0) + exp (θ b j1 ) + exp (θ b j1 ) exp (θ b j2 ). (2.2.11) The IRFs for a polytomous item are called operating characteristic curves (OCCs) and these graphs reflect the probability of scoring in each category. Let b j1 = 1.0 and b j2 = 1.0. Figure 2.2 shows the three curves resulting from the above example. The point where the 24

41 two curves representing y j = 0 and y j = 1 intersect is θ = b j1 = 1.0. Likewise, the curves representing y j = 1 and y j = 2 intersect at θ = b j2 = 1.0. From the graphs, it is clear that as ability increases the likelihood of a score of zero decreases. A score of one is most likely for θ = 0 and as ability increases, it is more likely that an examinee s score will equal 2. Figure 2.2. OCC for a 3-Category Item: b j1 = 1 and b j2 = 1 In this example, the difficulty of the subsequent step increased. The PC model does not require this type of ordering. For if the item were 8/4 + 3 =?, the difficulty of the first step (division) would be greater than the second step (addition). In such a case, the OCCs would behave as shown in Figure 2.3, where b j1 = 1.0 and b j2 = 1.0. The graph now reflects an increased likelihood of a score of zero the most difficult part of the question is the first part. If an examinee can perform the first operation, then it is very likely that he can also perform the second operation. Thus, the probability of a score of one is also lower. Items are not required to have the same number of categories, m j. If m j = 1, then this model is equivalent to the 1PL-model. In the above examples, b jh behaves like a step difficulty 25

42 Figure 2.3. OCC for a 3-Category Item: b j1 = 1 and b j2 = 1 parameter it describes the difficulty of each step in the question. For some instruments, this interpretation may not be appropriate, as in opinion polls and surveys. Thus b jh is called the transition location parameter (location referring to where b jh lies on the latent ability scale). The value of b jh reflects the relative difficulty in endorsing category h over category (h 1) [8, p. 166]. The PC model has been generalized to incorporate a discrimination parameter. Muraki s General Partial Credit (GPC) model is developed along the same lines as the PC model but is based on the 2PL model rather than the 1PL model and sometimes includes the scaling constant D [2] [8]. The probability of scoring in category k of item j is p(y jk θ, a j, b jk ) = ( k ) exp h=0 Da j(θ b jh ) ( mj i=0 exp i ), (2.2.12) h=0 Da j(θ b jh ) 26

43 with b j0 0. For the item j, the parameter a j reflects the level of discrimination, and b jh is the transition location or difficulty parameter between the categories h 1 and h. The number of categories is m j, and this value can vary across items. A partial credit item with three parts is a four category item and has possible category scores of {0, 1, 2, 3}, where a score of zero represents no credit and a score of three indicates full credit. As in the Partial Credit model, the parameter b 1 represents the point on the ability scale where an examinee has an equal chance of scoring 0 or 1, and this is the point where these two curves intersect. Similarly, the curves representing the probability of scoring 1 and of scoring 2 intersect at θ = b 2. For example, the following item taken from a college algebra exam is a four category partial credit item and the OCCs are shown in Figure 2.4: GPCM Example For f(x) = 5x and g(x) = x + 6, find the following: a. (f g)(x) = (Simplify your answer.) b. (g f)(x) = (Simplify your answer.) c. (f g)(2) = (Simplify your answer.) [5] An examinee who only answers one part of this question correctly would have a category score of y = 1 and receive partial credit corresponding to one-third of the item s total value. Answering two parts correctly results in a category score of y = 2 and earns two-thirds partial credit. Note that the b 1 parameter (the point where the curves for Category Zero and Category One intersect) has a largest value of the b i parameters, indicating that if an examinee can correctly answer part (a), then most likely the examinee will be able to answer all parts correctly. 27

44 Figure 2.4. GPCM Example It is important to note that when a model has multiple parameters, different parameter estimates may generate the similar ICCs or OCCs. The 3PL-model is prone to this phenomenon due to the interaction between the a, b, and c parameters [55]. The item below is from a college algebra exam. 3PL Example. Begin by graphing the standard cubic function f(x) = x 3. Then use transformations of this graph to graph the function given below. h(x) = 3x 3 Choose the correct graph of h(x) below. ID: [5] 28

45 Two sets of item parameter estimates were obtained using different programs: a = 0.39, b = 0.76, c = 0.25, and a = 0.35, b = 1.0, c = The a parameter estimates are quite similar, but the difference in the b and c parameters is noticeable. Yet the curves in Figure 2.5 are nearly identical. Figure 2.5. Similar ICCs with Different Parameter Estimates 2.3. Parameter Invariance Parameter estimates in IRT are invariant: Item characteristics are not dependent upon the particular population, and ability estimates are not dependent on a particular instrument or set of items. Stated another way, when an Item Response Model fits the data, the resulting Item Response Function, IRF, is the same regardless of the distribution of ability parameters [23]. Consider the following theoretical example from [23] which employs a 2- parameter model (2PL-model). First the model is fitted to an entire group, and then the group is subdivided into two subgroups, one of low ability and one of high ability. If the item characteristics are invariant, then all three groups should yield the same estimates. Let 29

46 P denote the probability of a correct response to an item by an examinee with ability θ: P = exp(da(θ b)) 1 + exp(da(θ b)). (2.3.1) The odds of a correct response are P 1 P = exp(da(θ b)). (2.3.2) Taking the natural log of both sides yields, ( ) P ln = Da(θ b) = Daθ Dab = αθ + β. (2.3.3) 1 P by letting α = Da, and β = Dab. The term ln ( P 1 P ) is known as the log odds ratio and can be seen to be a linear function of θ with two unknowns, α, β. When this equation is interpreted as a linear function of θ and P, the slope is α and the intercept is β. Thus, if P and θ are known at two points, the values of α and β can be calculated. Since the line will be the same for all values of θ, any subset of θ values will yield the same values for α and β which in turn yield the same values for a and b. The above assumes an exact model-data fit. Strictly speaking, invariance will not hold under actual testing situations. However, the degree of invariance can be measured. The following is an example from [1]. Suppose a given population is divided into two subpopulations, one with θ ( 3, 1) with mean of -2 and the other with θ (1, 3) with mean of 2. The observed proportions of correct responses are plotted for each group, and an item 30

47 characteristic curve is fitted to the data, a 2PL-model in this case. The two groups of data yield the same item parameters. The basic underlying reason for this invariance is that the ICC is fitted to the data within a specific range the lower range is fitted to the lower left tail of the curve, while the upper range is fitted to the upper right tail of the curve. There is however only one curve, hence the values of a and b are the same throughout the entire range of θ. Note that this example assumes that the ability levels of the examinees are known. In practice, there are direct methods to assess parameter invariance. One can estimate ability parameters using an entire test and then re-estimate these parameters using two random subsets of items. Ideally, a high correlation would exist between the two sets of ability estimates. A similar procedure is carried out for item parameter estimates using the whole population and using two random subsets of examinees with the hope of a high correlation between the two sets of estimates. A lack of invariance can mean that one or more of the underlying assumptions of IRT has not been met, that an incorrect model was used, or that there were an insufficient number of examinees and/or items for accurate estimation [23] Parameter Estimation Parameter estimation is a critical step in item response theory. The theory would not be possible without valid techniques for estimating item parameters and ability parameters separately and jointly. The goal of these estimation techniques is to obtain the parameter 31

48 values which produce the curve that best fits the observed data. A brief overview of the traditional methods follows. This development is from [8] and [23]. Traditional IRM parameter estimation techniques rely on maximizing a likelihood function, and the most challenging situation occurs when both item and ability parameters are unknown quantities. However, to illustrate the process, the estimation of the ability parameters when the item parameters are known is presented first using a 1-parameter logistic model, 1PL-model. Recall that in the 1PL-model, each item has a difficulty parameter, b, and all items are assumed to discriminate equally. Thus we set a = 1 and omit it from the following calculations. Consider a test consisting of J = 5 dichotomously scored items with difficulties b 1 = 1.9, b 2 = 0.6, b 3 = 0.25, b 4 = 0.3, b 5 = 0.45 and a 1PL-model. If this test were a mathematics test, item one would be considered an easy item, and items 2-5 would have an average difficulty level. Suppose an examinee has the response pattern (1,1,0,0,0); i.e., the examinee answered items 1 and 2 correctly and items 3-5 incorrectly. The goal of ability parameter estimation is to determine which value of θ (the ability parameter) has the greatest likelihood of producing the response pattern (1,1,0,0,0). The estimation process begins with the calculation of the probability of the observed response to each item over a 32

49 range of θ values. Let y = (y 1, y 2,, y J ) = (1, 1, 0, 0, 0) suppose θ = 3.0. Then p(y 1 = 1 θ = 3.0, b 1 = 1.9) = p(y 2 = 1 θ = 3.0, b 2 = 0.6) = exp ( ) 1 + exp ( ) =.2516, p(y 3 = 0 θ = 3.0, b 3 = 0.25) = (2.4.1) p(y 4 = 0 θ = 3.0, b 4 = 0.3) = p(y 5 = 0 θ = 3.0, b 5 = 0.45) = Intuitively, these probabilities are reasonable for an examinee of low ability. With the assumption of local independence, the probability of the response pattern y = (1, 1, 0, 0, ) can be calculated as: 5 p(y = (1, 1, 0, 0, 0) θ = 3.0, b j ) = p(y j θ, b i ) = (2.4.2) j=1 Similar calculations are done over the range of θ, which is typically ( 3, 3). Let P j = p(y j θ, b j ) for j = 1, 2,, J. Then the likelihood of response pattern y i of examinee i with ability θ is given by L(y θ, b) = J j=1 P y j j (1 P j) 1 y j, (2.4.3) where y j = 1 for a correct response, y j = 0 for an incorrect response, and b = (b 1, b 2,, b J ) is the vector of difficulty parameters. 33

50 Since one expects more items than in this simple example, the log likelihood function is used to avoid the multiplication of many small numbers: ln(l(y θ, b)) = J (y j ln(p j ) + (1 y j ) ln(1 P j )). (2.4.4) j=1 Maximizing the log likelihood function is equivalent to maximizing the likelihood function. In this example from [8], the log likelihood function was graphed, and a maximum was determined visually at ˆθ = In practice, a numerical approach such as Newton- Raphson is used to determine the maximum value. The above process is call Maximum Likelihood Estimation (MLE). One disadvantage of MLE is that ability estimates cannot be obtained for those examinees who answer all items correctly or incorrectly. In the first case, there are an infinite number of values above a certain threshold for θ, say above θ = 4, that would result in the same response pattern. There is no way of determining which value is most likely, and no value for ˆθ is obtained. In other words, the log likelihood function has no unique maximum value. The situation is similar for an examinee who answers all items incorrectly. A more challenging situation arises when item parameters and ability parameters are both unknown. A joint likelihood function is required, and this technique is know as Joint Maximum Likelihood Estimation, JMLE. To see the process, assume that a dichotomously scored test with J items is administered to N examinees. Let y = (y i, y 2,, y J ) represent a particular examinee s response pattern. With the assumption of local independence, the 34

51 probability that the examinee has this response pattern is the product p(y θ, b) = J j=1 P y j j (1 P j) 1 y j (2.4.5) where P j represents the probability for item j calculated according the to the chosen model. Let P nj, Q nj represent P nj, (1 P nj ) respectively for examinee n = 1, 2, 3,, N, and let θ = (θ 1, θ 2,, θ N ) be the vector of ability parameters for each examinee. Take the above equation and multiply it across N examinees to obtain L(y 1, y 2,, y N θ, b) = N J n=1 j=1 P y nj nj Q1 y nj nj. (2.4.6) To avoid multiplying many small numbers, the log likelihood function is used: N J ln L(y 1, y 2,, y N θ, b) = [y nj ln P nj + (1 y nj ) ln Q nj ]. (2.4.7) n=1 j=1 The goal is to find the values of θ n for each examinee and b j for each item that maximize this log likelihood function. Conceptually, the process is an iterative one consisting of steps which are repeated until the desired level of convergence is reached. For a 1PL model, the process begins by estimating the item s difficulty parameter, b j, using some initial value of the population s ability parameters (typically based on the ratio of correct to incorrect responses). The reason one begins by estimating item parameters rather than the ability parameters is that, in general, there are more examinees than items. Hence the data provides more information regarding the item s difficulty than the examinees ability levels [8]. Items are treated individually, and the estimated parameter values obtained are then treated as 35

52 the known item parameters and used to estimate each examinee s ability parameter. These ability parameter estimates are then used to obtain improved item parameter estimates. The steps are repeated until the desired level of convergence is obtained. Several observations are noteworthy. The metric or scale used is indeterminate. For example, in the 1PL model, the probability of an examinee s correct response to item j is p j (θ) = exp (θ b j) 1 + exp (θ b j ). (2.4.8) Since this probability depends on the difference between θ and b j, different values of each parameter could yield a difference of 1, say, and so would yield the same probability. Thus the metric employed is relative and must be fixed. Both item parameters and the ability parameter are on the same scale, and typically, one fixes the metric by either person centering or item centering [8, p. 41]. If person centering is used, the mean of the estimated ability parameters is set to zero after each step. If item centering is used, the mean of the item parameter estimates is set to zero after each estimation step. Another consideration is the actual number of parameter estimations required. If the 3PL model is used for a test with J items administered to N examinees, then N + 3J parameter estimates are required. Better estimates are obtained by increasing the sample size, but this increases the number of ability parameter estimates required without yielding any additional information about these ability parameters. In addition, often during the estimation process, items are found that do not fit the model. In JMLE, the item parameter estimation process is linked to the ability parameter estimation process, thus, removing the misfitting items requires restarting the process. 36

53 JMLE does not provide ability estimates for examinees with perfect scores (or with scores of zero) and does not yield item parameter estimates for items that are answered correctly by all respondents. In 2PL and 3PL models, the estimates have been shown to be inconsistent. Baker states, The marriage of the three-parameter ICC model and the JMLE procedure has not been a happy one [2, p. 107]. For the 3PL model, initial estimates must be very close to actual values, else it is likely that the Newton-Raphson equations will diverge, i.e, the estimation procedure fails. (This is not the case for the 2-PL model however.) Baker goes on to say that the inclusion of the pseudo guessing parameter c does considerable violence to the mathematics underlying the estimation process [2, p.108]. In addition, the JMLE procedure requires large numbers of examinees (1000 or more) and large numbers of items (60 or more) [2]. In the 3PL model, the JMLE procedure may fail without restrictions being place on the c parameter values. In particular, there exists the problem of finding no unique maximum value, especially when there is a lack of data in the lower ability region [8]. In general, estimation of the guessing parameter, c, is difficult without some prior information, and this difficulty influences the estimation of the other parameters. Problems exist associated with the ability parameter estimates in JMLE as well. The following example from de Ayala [8, p. 129] illustrates one issue. Suppose an instrument has only two items where Item 1 has discrimination of a 1 = 2.0, a difficulty level of b 1 = 0.0, and c 1 = 0.25, and for Item 2, a 2 = 1.0, b 2 = 0.5, and c 2 = 0.0. A response pattern of (1, 0) would mean that the examinee answered the more difficult, more discriminating item correctly but answered the easier, less discriminating item incorrectly. This situation would 37

54 result in a likelihood function having a local maximum at θ = 0.5, but as θ approaches -4, the function has an asymptotic value of 0.25 which is greater than the local maximum at θ = 0.5. No unique ability estimate is possible. Thus, with JMLE, it is possible to have local maxima and no global maximum. In general, JMLE is not used for 3PL models [8]. A method that addresses some of the difficulties of JMLE is Marginal Maximum Likelihood Estimation, MMLE. In MMLE, the ability parameter is integrated out, and estimates of the item parameters are obtained separately using a marginal likelihood function. Once these estimates are obtained, ability parameters are estimated treating the item parameters as known. This process requires designating an approximate distribution for the population s ability parameters and as such, requires a high number of examinees. A popular method of obtaining marginal maximum likelihood estimates of parameters is the EM (Expectation-Maximization) algorithm. Many commercial programs, such as BILOG, PARSCALE, and Xcalibre (formerly MULTILOG), and the freely available program ICL obtain parameter estimates via some variation of this algorithm [55] [27] [37]. These estimates can be viewed as marginal maximum likelihood estimates, as one of the likelihood functions that this algorithm maximizes is derived from the marginal distribution of the observed responses [22]. The EM algorithm uses both the observed data likelihood and the complete data likelihood. A detailed description of this algorithm as it is used in ICL is provided in Section

55 2.5. The Bayesian Approach The Bayesian approach to Item Response Models solves many of the issues of traditional maximum likelihood estimation while at the same time yields virtually the same results as these traditional methods when little prior information is known. Thus, the Bayesian approach will yield estimates as good as, and often better than, the more traditional methods. Further, as models attempt to accommodate more random effects, such as multidimensional traits or random item effects, the traditional models are no longer practical [7]. We begin with a brief overview of the Bayesian approach to probability. Since the likelihood function plays a key role here as well as in the traditional methods, some details regarding this function and the likelihood principle are provided. Bayes Theorem provides a rule from which one can obtain inverse inferences. Suppose A and B are two possible outcomes and A = A 1 A 2 A n with A i A j = when i j. For P (B) 0, Bayes Theorem yields P (A i B) = P (B A i)p (A i ) P (B) = P (B A i )P (A i ) n i=1 P (B A i)p (A i ). (2.5.1) If we think of B as the observed outcome and A as all the possible causes of B, then P (A i B) can be interpreted as the probability that A i caused B to occur. The above is sometimes written in the more general form: P (A B) = P (B A)P (A) P (B) P (B A)P (A), (2.5.2) for any two events A and B [38]. 39

56 Given a set of observations, y 1, y 2,, y n and an unknown parameter θ, the likelihood function, L(θ; y 1, y 2,, y n ), is the joint probability density (or mass) function of the observations viewed as a function of the parameter θ. Mathematically, the likelihood function is equal, up to a multiplicative constant, to the joint probability density f(y 1, y 2,, y n θ), which is a function of the data conditioned on a given θ. We will see that the multiplicative constant plays no role when maximizing the likelihood function [41]. Of particular importance is the following: Likelihood Principle The information brought by an observation x about θ is entirely contained in the likelihood function l(θ x). Moreover, if x 1 and x 2 are two observations depending on the same parameter θ, such that there exists a constant c satisfying l 1 (θ x 1 ) = cl 2 (θ x 2 ) for every θ, they bring the same information about θ and must lead to identical inferences [45, p. 15]. The likelihood principle implies all the information about θ contained in the data from an experiment that is required for inference about the unknown quantity θ is available from the likelihood function for this data [41, p. 36]. Thus, if two experiments have likelihood functions which are proportional, the inferences about θ must be the same. Press [41, p. 36] states the following as a corollary to the Likelihood Principle: all evidence about θ from an experiment and its observed outcome should be present in the likelihood function. The likelihood principle restricts all inferences to those that can be made based on the actual observations. 40

57 Bayes theorem has the following functional form h(θ y 1, y 2,, y n ) = f(y 1, y 2,, y n θ)g(θ) f(y1, y 2,, y n θ)g(θ) dθ. (2.5.3) Here, θ is a continuous unknown quantity about which we would like to make inferences, h is the probability density of θ obtained after observing the data (y 1, y 2,, y n ) which influence θ. The function h is called the posterior density function, f is the joint pdf of the data, and g(θ) is the prior probability density of θ. Since the likelihood function of θ is proportional to f, h(θ y 1, y 2,, y n ) = cl(θ; y 1, y 2,, y n )g(θ) cl(θ; y1, y 2,, y n )g(θ) dθ. (2.5.4) Note that the proportionality constant cancels out. The expression in the denominator represents the probability density of y = (y 1, y 2,, y n ). Suppose y = (y 1, y 2,, y k ) represents the response pattern of examinee i on a test measuring some ability θ. Let p(θ) represent the prior distribution of θ, and let p(y) be the probability function of y. Bayes theorem becomes p(θ y) = p(y θ)p(θ) p(y) p(y θ)p(θ), (2.5.5) and since all Bayesian inference is made from the posterior distribution, Bayesian inference depends on the data only through the likelihood function [41, p. 37]. Obtaining the 41

58 correct likelihood function is critical; in particular, caution must be used in choosing the prior distribution. A prior distribution for a parameter reflects one s prior knowledge of the parameter which can come from a variety of sources. For example, item parameter characteristics and/or test data may be available from a previous test. The researcher may also have his or her own predictions about the item parameters. Provided the researcher has the required mathematical background, it is reasonable to expect some valid prior estimates regarding the item parameters on a math test to be possible. The choice of a prior distribution is, for the most part, a subjective one. An objective prior would be one that assumed all values for a particular parameter to be equally likely, for example, a uniform density on the real line. Objective priors are often improper they do not integrate to one, and there are several disadvantages to such a choice. Both [15] and [41] state that improper priors often lead to improper posterior distributions. In reality, something is always known about the parameters in question, and it is always possible to specify a vague or uninformative proper prior. A proper prior that is nearly uniform can be obtained by using a four parameter beta distribution b(α, β, min, max) given by f(x) = (x min)α 1 (max x) β α. (2.5.6) B(α, β)(max min)) α+β 1 Setting α = β = 1.01, min = 6.0, and max = 6 yields a distribution that is nearly uniform on the interval ( 6, 6). This distribution is an option of some software programs and is flexible in terms of its shape, which can be altered by adjusting the parameters of the distribution. 42

59 All prior distributions have certain parameters that must be specified. Suppose a researcher suspects that the examinees being tested represent a sample from a known population and that examinees are sampled independently from this population. Then it would be reasonable to specify θ N (µ θ, σθ 2 ) as the prior distribution of θ. Note that now the mean and variance are also parameters which must be specified these are called hyper-parameters. The situation is similar for each estimated parameter: if prior distributions are specified for a parameter, then these distributions are characterized by hyper-parameters, which must also be specified. The choice of prior distribution determines the level of effect on the posterior distribution, and one should make this choice cautiously, since Bayesian parameter estimates are biased the estimates are pulled toward the mean of the prior distribution. The size of the effect is determined by the choice of prior hyper-parameters and by the amount of data available. For example, if the variance of the prior distribution is small, the resulting parameter estimates will have less variation. When there are a large number of observations, the influence of the prior will be reduced [15]. One suggestion, when feasible, is to obtain non- Bayesian parameter estimates via maximum likelihood techniques and use these estimates as a guideline when choosing the prior distributions [2]. There are techniques, known as Bayes hierarchal techniques, that allow prior distributions to be set for the hyper-parameters. This approach potentially requires the integration of high-dimensional integrals. However, there now exist computer simulation techniques, MCMC (Markov chain Monte Carlo) methods, that make this hierarchal Bayes approach 43

60 possible in some circumstances. Software programs like WinBUGS can be used for this approach [34]. However, it is not always feasible, in particular when a test is mixed format, when there is large amount of data, and/or when the models are complex. In these situations, MCMC techniques can be slow or might not produce results. The 3PL model is not that complex, but the c parameter estimates can be difficult to obtain. There are programs available in R that can be used for this type of analysis, and their development is ongoing [15]. Two Bayesian strategies that are popular and that can be used with mixed format tests are Bayes modal, or MAP (maximum a posteriori), and EAP (expected a posteriori) estimation. Both incorporate prior distributions for parameters. Bayes modal estimates are found by maximizing the posterior distribution (finding the mode), and EAP estimates are obtained from the expected value (or mean) of the posterior. The EM algorithm can be modified to obtain MAP estimates (see Section 2.5.1). The process is an iterative one and is often used for item parameter estimation. There are some advantages to obtaining EAP rather than MAP ability estimates, given known or previously estimated item parameters. The process is usually faster, as it is not an iterative approach (see Section 2.5.2), and the prior mean has been shown to have a bigger effect on MAP ability estimates. In addition, EAP ability estimates have smaller average squared errors than those of MAP [8]. A summary the main advantages of the Bayesian approach over traditional maximum likelihood methods from [15] follows. 44

61 When meaningful prior distributions can be specified (due to previous information or from the data itself), Bayesian estimates will be superior to marginal maximum likelihood estimates. Since parameters are estimated using prior distributions and response data, the estimate will have a smaller standard error than that of MMLE. Incorporating the prior distribution restricts item and ability parameter estimates to meaningful ranges. It is possible to obtain estimates for examinees with perfect scores and aberrant response patterns, as well as for items which are answered correctly by all examinees. A Bayesian estimation procedure is more appropriate for moderate and smaller sample sizes since it does not rely on large-sample asymptotic results like the marginal maximum likelihood procedure [15, p. 70]. When the sample sizes are very large, the two approaches yield comparable results, since a large amount of data means the posterior density is influenced more by the likelihood and less by the prior. The posterior density obtained is sufficient for most statistical inferences, including parameter estimates and confidence intervals, whereas in MMLE, obtaining estimates and related confidence intervals are two different (computational) problems [15, p. 70]. The following two subsections provide some details of MAP and EAP parameter estimation. 45

62 MAP Estimates of Item Parameters. The EM algorithm can be used to obtain maximum likelihood or Bayesian modal parameter estimates. Bradley Hanson wrote several papers regarding IRT and parameter estimation via the EM algorithm as it is implemented in the IRT software ICL [24] [25]. If there are N examinees and J dichotomous items, then the observed data is an N J matrix Y = (y 1, y 2,, y N ) t where y i = (y i1, y i2, y ij ) represents the response pattern of examinee i. In the case where each item is dichotomous, each entry y ij equals zero or one. The observed data together with the unknown parameters comprise what is known as the complete data, [(y 1, θ 1 ), (y 2, θ 2 ),, (y N, θ N )], which represents each examinees response pattern and ability level, θ [25]. One of the keys to simplifying the EM algorithm is letting θ be discrete, even though in the actual model, θ is continuous. Hanson states (t)his results in exactly the same algorithm as is obtained by deriving estimation procedures based on a continuous latent variable and then implementing approximations of those procedures with a discrete version of the continuous latent variable using numerical integration... [25, p. 1]. Use K + 1 points to divide the range of θ into K evenly spaced intervals, labeling each interval as q k. Each value of q k for k = 1, 2,, K has an associated probability, and we let π = (π 1, π 2,, π K ) denote these probabilities. Thus, the latent or ability random variable, Θ, has a probability distribution that [53] shows can be expressed as P (Θ = q k π) = P (Θ = q k π k ) = p(q k π) = p(q k π k ) = π k. (2.5.7) 46

63 Let represent the matrix of item parameters. In the case of a mixture of 2PL and 3PL items, would be a 3 J matrix where each row represents one of the parameters a, b, or c with c = 0 for 2PL items. Let y = (y 1, y 2,, y J ) represent the response pattern for one examinee to one item, where π represents the probability distribution for Θ. Then [53, p. 3] shows that the probability of this response pattern y is given by f(y, π) = = = = K f(y, q k, π) k=1 K f(y q k,, π)p(q k, π) k=1 K f(y q k, )p(q k π) k=1 K f(y q k, )π k. k=1 (2.5.8) The above calculation makes use of the assumptions of IRT models. First, when conditioned on the latent variable, the probability of a correct response to an item does not depend on the probability distribution of the population, π, and so f(y q k,, π) = f(y q k, ). Secondly, p(q k, π) = p(q k π) = π k, since the latent variable distribution does not depend on the item parameters. In Item Response Theory, it is also assumed that the responses to items are independent (when conditioned on the latent variable). Let δ j be the parameters of item j, and let P (q k δ j ) represent probability of a correct response to item j by an examinee with ability q k. For J items, the probability of response pattern y by an examinee with ability 47

64 level q k is, by independence, J f(y q k, ) = P (q k δ j ) y j [1 P (q k δ j )] 1 y j. (2.5.9) j=1 Substitute (2.5.9) into the last line of (2.5.8), to obtain the likelihood function for the observed data from N examinees and J items: L(Y, π) = ( N K π k i=1 k=1 j=1 ) J P (q k δ j ) y ij [1 P (q k δ j )] 1 y ij (2.5.10) Note that the above likelihood is based on the marginal distribution of the observed data. Hence, it is sometimes referred to as a marginal likelihood function. The parameters that need to be estimated are and π, and the EM algorithm simplifies this process by making use of both the observed data likelihood and the complete data likelihood. Hanson [25] derives the complete data likelihood in two different ways. One method is presented next. Let q 1, q 2,, q K be the K possible values of the discrete ability variable, θ, and suppose there are J items on the test. In each ability category, q k, with k = 1, 2,, K, there are 2 J possible response patterns. Thus, there are 2 J K possible outcomes to consider, each of which represents a particular response pattern and ability category. Let n k be the number of examinees in ability category k, and let r jk be the number of examinees in category k with a correct response to item j. Set m yqk equal to the number of examinees in ability category q k for each k with response pattern y. These counts will totally represent the complete data and have a multinomial distribution. 48

65 The probability of m yqk depends upon and π, and [25] writes f(m yqk, π) = f(m y q k,, n k )f(n k π). (2.5.11) The distribution of the number of examinees in ability category q k is f(n k π) is a multinomial distribution with L(n 1, n 2,, n K π) = K k=1 π n k k (2.5.12) as its likelihood. The function f(m y q k,, n k ) represents the distribution of the number of examinees with response pattern y and is also a multinomial distribution. To obtain the likelihood function, [25] notes that this distribution can be thought of as a product of J binomial distributions. For example, in a particular category k, we can view n k as the number of trials and r jk as the number of successes (the number of examinees who answer item j correctly), with the probability of success being given by the Item Response Model in use, P (q k δ j ). Given the counts r k = r 1k, r 2k,, r Jk, we obtain the likelihood for category k J L(r 1k, r 2k,, r Jk, n 1, n 2,, n K ) = P (q k δ j ) r jk [1 P (q k δ j )] n k r jk. (2.5.13) j=1 Taking the above equation over all the categories yields the likelihood function associated with f(m y q k,, n k ): K J L(r 1, r 2,, r K, n 1, n 2,, n K ) = P (q k δ j ) r jk [1 P (q k δ j )] n k r jk. (2.5.14) k=1 j=1 49

66 It follows that the complete data likelihood is the product of (2.5.14) and (2.5.12) L(r 1, r 2,, r K, n 1, n 2,, n K, π) = J K j=1 k=1 P (q k δ j ) r jk [1 P (q k δ j )] n k r jk π n k k. (2.5.15) Hanson states, (t)he counts r 1, r 2,, r K and n 1, n 2,, n K are the complete data sufficient statistics for the parameters and π [25, p.3]. Of course, the log likelihood is easier to work with. Set R = (r 1, r 2,, r K ) and n = (n 1, n 2,, n K ). The the log likelihood can be written log[l(r, n, π)] = J K r jk log[p (q k δ j )] + (n k r jk ) log[1 P (q k δ j )] + n k log[π k ]. j=1 k=1 (2.5.16) Equation (2.5.16) is the likelihood that is maximized by the EM algorithm. This maximization is made simpler by splitting the right hand side into two equations: J K log[l(r, n, π)] = r jk log[p (q k δ j )] + (n k r jk ) log[1 P (q k δ j )] j=1 k=1 K + n k log[π k ]. k=1 (2.5.17) The EM algorithm is iterative and consists of two steps: the E step and the M step. The following development is based on [24] and [25]. The E step provides estimates r (s) jk and n(s) k for steps s = 0, 1, 2,. The estimates from the E step are substituted into the right hand side of (2.5.17), and the M step maximizes these two equations (in two steps) to obtain the 50

67 estimates π (s+1) k and δ (s+1) j. Let l(δ j ) = K r jk log[p (q k δ j )] + (n k r jk ) log[1 P (q k δ j )], (2.5.18) k=1 and let then (2.5.17) can be written as l(π) = K n k log[π k ], (2.5.19) k=1 log[l(r, n, π)] = J l(δ j ) + l(π). (2.5.20) j=1 First, the values of π (s+1) k are obtained by maximizing (2.5.19). Since l(π) is the log likelihood of a multinomial distribution, the maximum likelihood estimate is given by n k. Given n(s) k N from the E step in iteration s, the M step calculates π (s+1) k = n(s) k N. (2.5.21) The values of δ (s+1) j are obtained in the second step (of the M step). The level of complexity of this calculation depends on the model employed, i.e., on the number of parameter estimates needed. Let t = 1, 2,, T j denote the parameters for item j. To calculate δ (s+1) j, take the derivative of l(δ j ) with respect to δ tj and set it equal to zero: l(δ j ) δ tj = K k=1 r (s) jk n(s) k P (q k δ j ) P (q k δ j ) = 0. (2.5.22) [1 P (q k δ j )]P (q k δ j ) δ tj 51

68 A system of equations is obtained that can be solved by numerical methods such as Newton- Raphson [25]. Given estimates (s) and π (s) obtained from the s 1 M step (when s = 0, initial estimates are used for 0 and π 0 ), the estimates for n (s) k and r (s) jk are found in the E step. Recall that n k denotes the number of examinees in ability category k, and r jk denotes the number of examinees in ability category k who answered item j correctly. These estimates are obtained by calculating expected values over f(q k y i, (s), π (s) ), (2.5.23) which is the conditional probability that examinee i has ability level θ = q k given the observed responses, y, (s) and π (s). Note that from Bayes theorem, we can write (2.5.23) as f(q k y i, (s), π (s) ) = f(y i q k, (s) )p(q k ) (s) K k=1 f(y i q k, (s) )p(q k ) (s) f(y = i q k, (s) )π (s) k K k=1 f(y. i q k, (s) )π (s) k (2.5.24) Thus n (s) k ) = E (n k Y, (s), π (s) = N f(y i q k, (s) )π (s) k K k=1 f(y. (2.5.25) i q k, (s) )π (s) k i=1 Substituting (2.5.9) into the above equation yields n (s) k = N i=1 π (s) k K k=1 π(s) k J j=1 P (q k δ (s) j ) y ij [1 P (q k δ (s) j J j=1 P (q k δ (s) j ) y ij [1 P (qk δ (s) j )] 1 y ij )] 1 y ij. (2.5.26) 52

69 Similarly, r (s) jk = E ( r jk Y, (s), π (s) ) = = = N y ij f(q k y i, (s), π (s) ) i=1 N i=1 N i=1 y ij f(y i q k, (s) )π (s) k K k=1 f(y i q k, (s) )π (s) k J j=1 P (q k δ (s) y ij π (s) k K k=1 π(s) k J j=1 P (q k δ (s) j j ) y ij [1 P (q k δ (s) ) y ij [1 P (qk δ (s) j j )] 1 y ij )] 1 y ij, (2.5.27) where y ij equals zero when an item is answered incorrectly and equals one when an item is answered correctly. The above procedure is used to obtain maximum likelihood estimates of parameters. The convergence criterion is the relative difference in parameter estimates after each iteration. Although the development presented involves solely dichotomous items, estimates for polytomous items are also possible, in particular, for the GPC model [25]. Given m j response categories (0, 1,, m j 1) for item j, the probability that an examinee answers in a particular response category, m i, i = 0, 1,, m j 1 given an ability category q k is given by the item response model in use: P i (q k δ j ). Hanson shows in [25, p. 9] that the complete data likelihood is K J k=1 j=1 m j 1 i=0 K log[p i (q k, δ j )]r (s) jki + n k log[π k ], (2.5.28) k=1 where r (s) jki represents the number of examinees with ability q k who answer in category i of item j. In [53], Hanson shows that m j 1 n (s) k = r (s) jki. (2.5.29) i=0 53

70 Thus in the E step, the value of r (s) jki is obtained and then used in M step to find values that maximize (2.5.28), with the constraint that for each item j = 1, 2,, J and in each ability category k = 1, 2,, K we must have m j 1 i=0 P i (q k, δ j ) = 1. (2.5.30) Bayes modal estimates are found by maximizing the marginal posterior density (or equivalently, its logarithm). The EM algorithm can be modified to obtain Bayes modal estimates of item parameters by incorporating prior distributions for each type of item parameter. Let g(δ tj ) be the prior distribution for the item parameter δ tj where t = 1, 2, T j, and T j represents the number of parameters characterizing item j. Parameter estimates are found that maximize the complete data posterior distribution. Letting l(δ j ) and l(π) be as above, [25] shows that log[l(r (s), n (s), π)] = T J J j l(δ j ) + log[g(δ tj )] + l(π) (2.5.31) j=1 j=1 t=1 is the logarithm of the complete data posterior. The EM algorithm is the same as before except in part two of the M step, the system of equations to be solved is given by log[g(δ tj )] δ tj + K k=1 r (s) jk n(s) k P (q k δ j ) P (q k δ j ) = 0. (2.5.32) [1 P (q k δ j )]P (q k δ j ) δ tj 54

71 EAP Estimation of Ability Parameters. The term EAP is an abbreviation for expected a posteriori, where a posteriori refers to the posterior density, and expected refers to expected value. Ability parameter estimates are found by calculating the expected value of the posterior density of θ for each examinee, using the known item parameters. As in the previous section, the ability parameter is treated as discrete. The range of θ is ( 4, 4) and is divided into K regions using K + 1 points labeled q k for k = 1, 2, K. The prior distribution of θ is given by π = (π 1, π 2,, π K ), where π k is the probability associated with q k. Suppose θ N(0, 1), and let z(θ k ) represent the height of the normal curve at z = q k. If K = 41, then π k = z(q k ) 8 40 for each k = 1, 2,, K. These points are called quadrature points, and their corresponding probabilities are quadrature weights. The number of quadrature points used is determined by the researcher. Generally, K 20, with a recommendation of using at least 2 J, where J is the number of items, when it suspected that the latent distribution is skewed [8]. The following was adapted from [51]. Using the same notation as used in the previous section, the posterior density for a given examinee s ability can be written as f(θ y, ) = πf(y θ, ) K k=1 π kf(y θ, ) (2.5.33) where y represents one examinee s response pattern to the J items on a test, and represents the previously estimated or known item parameters. The EAP estimate of ability is found by calculating E(θ y, ) = K q k f(q k y, ). (2.5.34) k=1 55

72 The values of q k are known and are based on the choice of the subdivision of the range of θ. For each k, the second term of the sum in (2.5.34) is f(q k y, ) = f(q k )f(y q k, ) K k=1 f(q k)f(y q k, ). (2.5.35) Recall that f(q k ) = π k and is determined by the choice of the ability prior distribution. The value of f(y q k, ) represents the probability of an examinee with ability level q k having response pattern y. This value is given by the model employed. Letting P represent the probability determined by the model, J f(y q k, ) = P (y j, q k ) y j [1 P (y j, q k )] (1 yj), (2.5.36) j=1 where J is the number of items, and y j is equal to 0 or 1. In ICL, the E step of the EM algorithm is used to calculate the posterior density of the ability parameter for each examinee, from which the expected value, or mean, is obtained. Unlike the Bayes modal estimation of item parameters, this process is not an iterative one, and the above calculations are straightforward and handled quickly via computers Information Functions An advantage of Item Response Theory over Classical Test Theory is the availability of item and test information functions [23]. These functions arise naturally from the Fisher Information, I(θ), which is typically developed, as described below, in the context of Maximum Likelihood Methods [28]. Let X be a random variable with pdf f(x; θ), and let θ Ω with Ω an open interval. There are regularity conditions that must be met that ensure that 56

73 θ is not an endpoint of the support of f and that integration and differentiation can be interchanged when differentiating with respect to θ [28, pps. 313, 319]. To derive the Fisher Information, first note that since 1 = + f(x; θ) dx, taking the derivative with respect to θ yields 0 = + In Ω, we have f(x; θ) > 0, and multiplying by 0 = Ω f(x; θ) θ f(x;θ) θ f(x; θ) dx = f(x; θ) Ω dx. f(x; θ) f(x; θ) yields ln f(x; θ) f(x; θ) dx, (2.6.1) θ since This implies ln f(x; θ) θ = 1 f(x; θ) f(x; θ). θ ( ) ln f(x; θ) E = 0. (2.6.2) θ Differentiating (2.6.1) again with respect to θ yields 0 = Ω 2 ln f(x; θ) f(x; θ) dx + θ 2 Ω ln f(x; θ) θ ln f(x; θ) f(x; θ) dx. (2.6.3) θ The second term of the right hand side above is the Fisher Information and can be written as I(θ) = Ω ( ) [ 2 ( ) ] 2 ln f(x; θ) ln f(x; θ) f(x; θ) dx = E. (2.6.4) θ θ 57

74 From (2.6.3), we also have I(θ) = Ω 2 ln f(x; θ) f(x; θ) dx = E θ 2 ( 2 ln f(x; θ) θ 2 ). (2.6.5) It is also useful to note that since ( ) [ ( ) ] 2 ln f(x; θ) ln f(x; θ) V ar = E θ θ [ E ( )] 2 ln f(x; θ), θ from (2.6.2) we obtain ( ) ln f(x; θ) I(θ) = Var. (2.6.6) θ In the context of IRT, and in particular when dealing with one dichotomous item, j, and the Item Response Function P j (θ), the Fisher Information becomes [23] [52] I j (θ) = P j(θ) 2 P j (θ)(1 P j (θ)). (2.6.7) To see why, recall that for a dichotomous item, the possible values of Y are 0 and 1. For simplicity, ignore the subscript j and note that the likelihood function is L(Y θ) = P (θ) Y (1 P (θ)) (1 Y ), (2.6.8) and ln L(Y θ) = Y ln P (θ) + (1 Y ) ln(1 P (θ)). 58

75 Thus, and ln L(Y θ) θ = Y P (θ) P (θ) (1 Y ) (1 P (θ)) P (θ), 2 ln L(Y θ) θ 2 = Y ( P (θ) P (θ) P ) ( (θ) 2 P (θ) (1 Y ) P (θ) 2 (1 P (θ)) + P ) (θ) 2. (1 P (θ)) 2 ( ) 2 ln L(Y θ) Using I(θ) = E and the fact that E(Y ) = P (θ), we have θ 2 I(θ) = E [ Y ( = E[Y ] ( = P (θ) ( P (θ) P (θ) P ) ( (θ) 2 P (θ) (1 Y ) P (θ) 2 (1 P (θ)) + P )] (θ) 2 (1 P (θ)) 2 ( P (θ) P (θ) P ) ( (θ) 2 P (θ) E[(1 Y )] P (θ) 2 (1 P (θ)) + P )) (θ) 2 (1 P (θ)) 2 ( P (θ) P (θ) P ) ( (θ) 2 P (θ) (1 P (θ)) P (θ) 2 (1 P (θ)) + P )) (θ) 2 (1 P (θ)) 2 = (P (θ) P (θ) 2 ( = P (θ) 2 P (θ)(1 P (θ)) P (θ) P (θ) P ) (θ) 2 (1 P (θ)) ) P (θ) 2 = P (θ)(1 P (θ)). For the 3PL model, [23] states that for item j, I j (θ) = D 2 a 2 j(1 c j ) [c j + exp(da j (θ b j ))][1 + exp( Da j (θ b j ))] 2 (2.6.9) The best way to confirm this fact is to write the 3PL model as exp(da(θ b)) P (θ) = c + (1 c) 1 + exp(da(θ b)) = c + (1 c) 1 + exp( Da(θ b)), 59

76 and omit the subscript i for the moment. Then Q(θ) = 1 P (θ) = 1 c + (1 c) 1 + exp( Da(θ b)) (1 c) exp( Da(θ b)) =, 1 + exp( Da(θ b)) and P (θ)q(θ) = (1 c)(c exp( 2Da(θ b)) + exp( Da(θ b))) (1 + exp( Da(θ b))) 2. Also, We have P (θ) = (1 c)da exp( Da(θ b)) (1 + exp( Da(θ b))) 2. [P (θ)] 2 P (θ)q(θ) = (1 c)2 D 2 a 2 exp( 2Da(θ b)) (1 + exp( Da(θ b))) 2 (1 + exp( Da(θ b))) 4 (1 c)(c exp( 2Da(θ b)) + exp( Da(θ b))) = (1 c)d 2 a 2 (1 + exp( Da(θ b))) 2 (c + exp(da(θ b))), which rearranges to form (2.6.9). Several observations regarding an item s information can be made from (2.6.9). In general, information increases when an item is more discriminating, i.e., as a increases. Information will be higher when b is near θ, that is when the difficulty is near the ability level, and when guessing is minimal, i.e., as c goes to zero [23]. For a 3PL item, [23] states I(θ) max = b + 1 [ Da ln 0.5(1 + ] 1 + 8c), (2.6.10) thus the maximum amount of information occurs slightly to the right of θ = b. 60

77 For example, consider the following two 3PL items: 3PL Item A. Find and simplify the difference quotient f(x + h) f(x), h 0 for the given function. h f(x) = x 2 + 5x + 4 A. 2x + h + 4 B. 2x + h + 5 C. 2x2 + 2x + 2xh + h 2 + h + 8 h D. 1 Answer: B ID: [5] 3PL Item B. Find the horizontal asymptote, if any, of the graph of the rational function. A. y = 3 4 B. y = 3 f(x) = 3x + 7 4x 2 C. y = 7 2 D. no horizontal asymptote Answer: A ID: [5] For Item A, the item parameters are a = 0.88, b = 0.49, and c = 0.23 and for Item B, these values are a = 0.46, b = 1.76, and c = Item A is a highly discriminating item provides more information, as can be seen in Figure 2.6. A 2PL item provides more information than a 3PL item of similar discrimination and difficulty. The 2PL item presented below has item parameters a = 0.89 and b =

78 Figure 2.6. The Effect of the a Parameter on Item Information Figure PL and 3PL Items with Similar a Parameters which are similar to the 3PL Item A above. The graphs in Figure 2.7 show the increased information that the 2PL item provides. 2PL Item. Write an equation of the parabola that has the same shape as the graph of f(x) = 2x 2, but with the point (2, 4) as the vertex. f(x) = Answer: 2(x 2) 2 4 ID: [5] 62

79 Item information functions also exist for polytomously scored items. For example, consider (2.2.12), the Generalized Partial Credit Model or GPCM: p(y jk θ, a j, b jk ) = ( k ) exp h=0 Da j(θ b jh ) ( mj i=0 exp i ). h=0 Da j(θ b jh ) Recall a j is the discrimination parameter, b jh is the difficulty parameter between the categories h 1 and h, and k represents the response category. For ease of notation, write p(y jk θ, a j, b jk ) as P jk (θ). Then [10] shows in detail that the information function for item j is ( m m ) 2 I j (θ) = D 2 a 2 j k 2 P jk (θ) kp jk (θ). (2.6.11) k=0 k=0 For an example, consider the following item: GPC Example. Solve the following exponential equation by taking the natural logarithm on both sides. Express the solution in terms of natural logarithms. Then, use a calculator to obtain a decimal approximation for the solution. 3e 8x = 1449 What is the solution in terms of natural logarithms? The solution set is { }. What is the decimal approximation for the solution? The solution set is { }. Answer: ln 483, 0.77 ID: [5] 8 There are three possible responses to this item, {0, 1, 2}, and the item parameters are a = 0.60, 63

80 Figure 2.8. Information for a GPC and a 2PL Item b 0,1 = 0.49 and b 1,2 = The maximum amount of information occurs at θ = 0.55, and Figure 2.8 shows that this item provides more information than the 2PL item. Knowing the amount of item information and where that information is maximized is useful to test designers who may want to increase the number of items with information at some critical range, say the pass/fail score, and discard or replace items with little useful information. In addition to item information, in IRT there is the concept of the test information. For a test with n = 1, 2, 3,, N items, the test information function is given by N I(θ) = I n (θ). (2.6.12) n=1 The test information is simply the sum of the individual item information functions which follows from the fact that expected value is a linear operator: The expectation of a sum is the sum of the expectations. The test information function provides the test designer with an idea of where the maximum amount of information of the test is, as a whole. Figure

81 shows the test information for a college algebra exam where the maximum of 11.9 occurs at θ = 1.0 Figure 2.9. Test Information Since the test information is the sum of the item information, test designers can add and/or remove items to achieve a maximum amount of information at the desired ability level. In Item Response Theory, the researcher knows how much each item contributes to test information. Test information plays another role. The amount of information provided by a test at θ is inversely related to the precision with which ability is estimated at that point: SE(ˆθ) = 1 I(θ) (2.6.13) where SE(ˆθ) is called the standard error of estimation [23]. Equation (2.6.13) is the IRT version of the CTT standard error of measurement and has the following properties: It 65

82 decreases as the number and quality of items increase (in terms of higher discrimination and lower guessing parameters) and when the distribution of the difficulty parameters is close to the ability distribution of the population [23]. Test information also leads to the concept of the relative efficiency of a test, which allows tests to be compared [23]. Given two tests, A and B, aimed at measuring the same ability and defined over the same ability scale, the relative efficiency is defined as RE = I A(θ) I B (θ). (2.6.14) Suppose a test designer is interested in shortening a test. An example similar to one in [23] is illustrative. Suppose Test B is the test currently in use, and suppose it has been accepted as adequately measuring the ability in question. Let Test A be a test defined on the same ability scale aimed at measuring the same ability. If I A (θ) = 25.0 and I B (θ) = 20.0 then RE(θ) = The interpretation is that Test A functions as if it is 25% longer than Test B, or alternatively, one could shorten Test A by 20% without affecting the precision of measurement. If one knows in advance where the maximum amount of information is needed for a test and what the shape the test information curve should assume, items can be added and removed until the desired result is achieved The Ability Scale In the academic setting, the goal of a test is to provide a meaningful measure of a student s mastery of concepts and skills. This measure is typically a grade based on the number-correct score or the instructor s interpretation of the student s performance. It is 66

83 important to be able to relate the ability score obtained with IRT to a familiar scale. A score based on the number of correct responses to a mathematics test is easy to obtain. However, this number-correct score is not necessarily a reliable estimate. For example, a examinee of low ability may guess enough correct answers on a multiple choice test and receive a passing grade. Ideally, this should not happen. An advantage of IRT is that the models take into account the guessing factor and the difficulty level of the individual items. However, an estimate of an examinee s ability using IRT is not so straightforward. A correct model must be chosen, and the model must fit the data. When there is good model-data fit and the results match up with the researchers instincts about the test, the ability score obtained via IRT is a more reliable one. The following development is from [23]. The goal is to transform θ to a score more meaningful to the instructor. Call this score τ, the true-score. Let y j represent the response to item j, assume y j {0, 1}, and suppose there are J items on a test. If X denotes the number right score, then X = J y j. (2.7.1) j=1 Let ( J ) τ = E (X) = E y j = j=1 J y j E (y j ). (2.7.2) j=1 The probability that y j = 1 is given by the particular IRT model in use and is denoted P j (θ), and since there are only two possible outcomes, the probability that y j = 0 is 1 P j (θ) = Q j. 67

84 Figure Test Characteristic Curve Thus, E (y j ) = 1 P j (θ) + 0 Q j (θ) = P j (θ). It follows that τ = J y j E (y j ) = j=1 J P j (θ), (2.7.3) j=1 provided the model fits the data. This calculation also yields what is known as the test characteristic curve, which is the sum of all the item characteristic curves, the ICC s. For an example, see Figure 2.10 which displays a test characteristic curve for a college algebra exam. The domain score, π, given by π = 1 J J P j (θ), (2.7.4) j=1 is usually more desirable, since this represents proportion correct score [23]. Note that π [0, 1], unless the 3PL model is being used, in which case π [ 1 J cj, 1] because the 68

85 lower asymptote for each item j is c j. The domain score can be calculated at points over the range of θ to yield the desired transformation. Table 2.1 provides an example of this calculation. Table 2.1. Domain Score π θ τ π θ τ π

86 2.8. Model-Data Fit The area of model-data fit in item response theory is one of active research, and there are several approaches to verifying the fit of a chosen model or models. Almost always, one begins by attempting to verify the assumptions of unidimensionality and local independence. Then, once parameter estimates are obtained, it is important to show that the correct IRT model was chosen and that this model fit the data. Goodness of fit statistics, residual analyses, and a comparison of the relative fit of different models all can be used to establish model-data fit to a desirable degree. Latent variables are assumed to account for the correlation of test items. All models make the assumption of conditional or local independence that the probability of a correct response to an item is independent when the latent variable or variables have been conditioned on. When it is assumed that there is only one latent variable, verifying independence is closely related to verifying unidimensionality. The methods of Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) prove useful here. Alternatively, if one can show that the model fits the data, that information provides further evidence that these underlying assumptions hold Unidimensionality and Local Independence. The following discussion of the methods of EFA and CFA is based on [6]. Underlying both EFA and CFA is the common factor model which is attributed to L.L. Thurstone [49]. The basic premise of factor analysis, in this case, is that the observed responses to items on a test are correlated because of some common latent ability. If this factor can be conditioned on, then, ideally, these observations would no longer be correlated, and therefore, the responses would be independent of one 70

87 another, i.e., conditionally independent. Given a set of items and the observed responses to those items, the first step is to form the inter-item correlation or covariance matrix. Factor analysis treats the variance of each item as consisting of two parts: common variance and unique variance. The common variance results from the latent factors. The unique variance is composed of the random error variance and the variance specific to that item. Exploratory factor analysis is typically performed first to aid in identifying the number of factors underlying the common variance. For a simplified example, based on the one found in [6, p. 15], consider a four item test taken by 300 examinees, and suppose the inter-item correlation matrix obtained is R =, (2.8.1) where, for example, the entry r 32 = 0.66 represents the correlation between Item 3 and Item 2. The test is assumed to be unidimensional, which means there is one factor, call it η 1, that accounts for the correlations between items. (The correlation or the covariance matrix can be used the default for many computer programs is to use the correlation matrix.) Ideally, each item is affected by this factor to some degree. We say that the factor loads on each item and denote these factor loadings by λ mk, where m refers to the factor, and k is the item number. Here, m = 1. EFA produces the best possible matrix, Λ, of these factor loadings that account for the observed inter-item correlations. 71

88 In general, suppose we have the observed responses by n examinees to a test with k items, and let η m represent the mth common factor in this case. Letting y j represent the n observed responses to item j, the fundamental equation is given by y j = λ j1 η 1 + λ j2 η λ jm η m + ɛ j, (2.8.2) where λ jm is the mth factor loading which relates item j to η m [6]. There are several assumptions. For i = 1, 2,, m, we have E(η i ) = 0 and var(η i ) = 1, with cov(η i, η j ) = 0 when i j. Similar assumptions hold for ɛ i and additionally, var(ɛ i ) = θ i (this is the unique variance). Also, for all i, j, it is assumed that cov(ɛ i, η j ) = 0. In the simplified example from above, there are four equations: y 1 = λ 11 η 1 + ɛ 1 y 2 = λ 21 η 1 + ɛ 2 (2.8.3) y 3 = λ 31 η 1 + ɛ 3 y 4 = λ 41 η 1 + ɛ 4, which can be written as y = Λ y η + ɛ. (2.8.4) Letting R represent the correlation matrix of the 4 items, we can obtain R = Λ y ΨΛ y + Θ ɛ. (2.8.5) 72

89 In the general case with k items (or indicators) and m latent factors, R is a k k symmetric correlation matrix, Λ y is the k m matrix of factor loadings λ km, Ψ is the m m symmetric matrix of factor correlations, and Θ ɛ is a k k diagonal matrix of unique variances, ɛ k [6]. (Note that [6] uses Σ for the correlation matrix, but in most textbooks, see [43] for example, Σ is used for the covariance matrix, and R is used for the correlation matrix.) Equation (2.8.5) is best understood by looking at the covariance matrix and noting that Σ = cov(λη + ɛ) = cov(λη) + covɛ = Λcov(η)Λ + Θ ɛ. (2.8.6) = ΛΨΛ + Θ ɛ. = ΛΛ + Θ ɛ. Exploratory factor analysis can be performed with the premise that there is one underlying latent factor. Then the results can be analyzed to see if this assumption is reasonable. For the above example, R = Λ y ΨΛ y + Θ ɛ is given by [ ] [ ] = (2.8.7) 73

90 One way to assess the above result is to see how well the loadings, λ j1, for j = 1, 2, 3, 4 reproduce the correlation matrix. For example, COR(y 1, y 2 ) = λ 11 Ψ 11 λ 21 = (.828)(1)(.841) =.696, (2.8.8) and the actual correlation is.70. The difference between these two variables is a residual, and most computer programs will produce the residual matrix for inspection as well as various goodness of fit indices (which are discussed later in this section). Determining the number of underlying factors is closely related to the eigenvalues of the covariance matrix. Each eigenvalue describes how much a particular factor contributes to the total variance. To see why, consider the following development based on [43]. Let S represent the sample covariance matrix that is, the inter-item covariance matrix based on the observed responses to k items. Then S is a symmetric positive definite matrix and has a spectral decomposition, or eigen-decomposition, such that the eigenvectors are normalized. The fundamental equation given by (2.8.6) is approximated with an estimator, ˆΛ, such that S = ˆΛˆΛ + ˆΘ. (2.8.9) Ignore ˆΘ for the moment, and factor S into ˆΛˆΛ using spectral decomposition so that S = CDC, (2.8.10) 74

91 where C is a matrix of normalized eigenvectors, and θ θ D = θ k (2.8.11) is a diagonal matrix of the corresponding eigenvalues. Then, since S is positive definite, all the eigenvalues are positive, and so D can be factored into D 1 2 D 1 2. Then S = CDC = (CD 1 2 )(CD 1 2 ) (2.8.12). Thus, we set ˆΛ = CD 1 2. For example, suppose there are 4 items, i.e, k = 4. Then ˆΛ = CD 1 2 would be θ ˆΛ 11 ˆΛ14 c 11 c 14 0 θ = θ3 0 ˆΛ 41 ˆΛ44 c 41 c θ4 θ1 c 11 θ2 c 12 θ3 c 13 θ4 c 14 θ1 c 21 θ2 c 22 θ3 c 23 θ4 c 24 =. θ1 c 31 θ2 c 32 θ3 c 33 θ4 c 34 θ1 c 41 θ2 c 42 θ3 c 43 θ4 c 44 (2.8.13) 75

92 Note that the sum of the squares of the 3rd column of ˆΛ is 4 4 ( ) 2 ˆλ 2 i3 = θ3 c i3 i=1 i=1 4 = θ 3 c 2 i3 = θ 3, i=1 (2.8.14) since these are normalized eigenvectors with length 1. The total variance of Item 3 consists of two parts: the common variance due to underlying factors and the unique variance, which is unique to the item and is represented by the entry s 33 in the sample covariance matrix S. The common variance is given by ˆλ ˆλ ˆλ ˆλ 2 34, or the sum of the squares of the 3rd row of ˆΛ. Note that the second factor contributes ˆλ 32 to the common variance of Item 3. Thus, the second factor contributes ˆλ ˆλ ˆλ ˆλ 2 42, or the sum of the squares of the second column of ˆΛ, which we have just shown is equal to θ 2, the second eigenvalue. Therefore, the eigenvalues describe how much each factor contributes to the total sample variance, which is given by tr(s). So, in general, the proportion of the variance arising from factor j is given by ˆλ 2 1j + ˆλ 2 2j + + ˆλ 2 kj tr(s) = θ j tr(s). (2.8.15) In factor analysis, the aim would be to reduce the size of the matrix CD 1 2, but in exploratory factor analysis, this is not necessary. The goal is to calculate the eigenvalues and sort them according to size with the hope that, in the case of a unidimensional test, there is one eigenvalue which is significantly larger than the others, or one that accounts for a predetermined proportion of the total variance. Most computer software programs use the correlation matrix rather than the covariance matrix, but all of the above still applies. When 76

93 the raw data is categorical or ordinal (for example, the responses to polytomous items on a test), polychoric correlations are used when producing the inter-item correlation matrix. There are software packages in R that do these calculations, and sometimes this specification is unnecessary because the software detects the type of data automatically. Several guidelines exist for choosing the correct number of factors, m, based on the eigenvalues of the correlation matrix. The decision can be based upon the percentage of variance the researcher wants the eigenvalues to account for. Another option is to choose m equal to the number of eigenvalues greater than one (the Kaiser-Guttman rule), where number one represents the average value of the total correlation, since the correlation matrix has ones on the diagonal. However, this approach often overestimates the number of factors whereas a scree plot is a popular method in IRT research that is readily available, frequently used, and usually provides satisfactory results [13, p. 292]. A scree plot is a graph with eigenvalues on the vertical axis and components or factors along the horizontal axis, that is visually inspected to determine where the slope of the line segments between eigenvalues changes substantially. For example, the scree plot in Figure 2.11, which was produced using the R package stats() [42] and a twenty-five item college algebra exam, provides evidence of a unidimensional instrument despite the fact that there were five eigenvalues greater than one. Exploratory Factor Analysis is primarily used to establish unidimensionality and provides a basis for further analysis using CFA. The fa() command of the R package psych [44] provides goodness of fit indices that aid in determining whether or not a one factor solution 77

94 Figure Scree Plot is meaningful. These indices are similar to those reported for confirmatory factor analysis and are presented following the discussion of CFA. Confirmatory factor analysis, which is also based on the common factor model, can be used to establish local or conditional independence [39]. In CFA, not only is the number of factors accounting for the covariance among the items predetermined, but all aspects of the model are specified. However, since we are considering a one factor model, on the surface, the two approaches seem quite similar. When conducting CFA, it is presumed that there is enough prior knowledge that the researcher can completely specify the model. For example, in the case of a one dimensional 50 item mathematics test, the model could be specified as 78

95 MATH λ 1 IT1 MATH λ 2 IT2. MATH λ 50 IT1 IT1 ɛ 1 IT1 IT2 ɛ 2 IT2 (2.8.16). IT50 ɛ 50 IT50 MATH 1.0 MATH This model specifies that one factor, MATH, accounts for all the covariance among the items. The one sided arrows represent factor loadings on the items. The two sided arrows represent the unique variances of the items, or in the case of the factor MATH, this arrow shows that its variance is fixed at one. For each item, i = 1, 2,, 50, the parameter, λ i, is the factor loading and represents the covariance of that item due to the latent factor, MATH. Each item, IT i, has its own unique variance, ɛ i. The variance of the factor MATH is set to 1.0 in this model, making this a standardized model specification. There are 100 unknowns or parameters to be estimated: 50 factor loadings and 50 unique variances. The inter-item covariance matrix, S, has 50(51)/2 variances and covariances these are the knowns in the model specification (the diagonal plus below diagonal elements of the matrix). In general, if there are k items, there will be k(k + 1)/2 known covariances or variances and 2k unknown parameters to be estimated. Parameter estimation can be done in different ways, but maximum likelihood techniques are often preferred because of the fit indices and standard error estimates the method can 79

96 produce [6]. With S as the observed covariance matrix and Σ as the covariance matrix predicted by the model, the so-called fitting function is given by F ML = ln S ln Σ + trace [ (S) ( Σ 1)] k. (2.8.17) This function is minimized in the maximum likelihood procedure [6]. Note that if the model fit perfectly, the two determinants would be equal, i.e., S = Σ making (S)(Σ 1 ) the identity matrix I with trace [ (S)(Σ 1] = k. Thus, F ML would be zero. In short, the goal of the maximum likelihood procedure is to minimize the difference between the two matrices. The determinants are used because they can be viewed as providing a generalized measure of variance for the entire set of variables contained in the matrix [6, p. 73]. The trace is the sum of the variances. It is possible to specify and analyze a model using the R package sem: Stuctural Equation Models [16]. This software yields parameter estimates (provided the method converges) along with the resulting matrix, Σ, and the matrix of residuals along with several fit indices. Almost every discussion of fit indices contains a warning or disclaimer due to the controversies over which indices are preferred and how they should be used, for example see [29]. The indices described next, along with specific recommendations, are from [6] and fall into three categories: indices of absolute fit, indices reflecting model parsimony and indices of comparative fit. The typical χ 2 goodness of fit index is almost always reported but rarely used as the only measure of model fit, especially when sample sizes are large, as it becomes inflated. It is a measure of absolute fit, with the null hypothesis being S = Σ, a rather stringent 80

97 hypothesis [6, p.81]. The number of degrees of freedom, df, is calculated by taking the number of known variances and covariances, k(k + 1)/2, and subtracting the number of freely estimated parameters. In the example above, there would be 50 51/2 100 = 1175 degrees of freedom. This index nearly always rejects the model when large samples are used [29, p. 54]. Another absolute fit index is the standard root mean squared residual (SRMR) which is a measure of difference between the correlations of S and Σ. It is calculated by summing the squares of all the elements of the residual matrix, dividing this sum by the number of elements in the matrix and finally taking the square root of this average. It takes on values between zero and one, with zero indicating perfect fit. In practice, recommendations are that SRMR <.06 or SRMR <.05 [6] [29]. Fit indices that favor a simpler model fall under the category of parsimony correction [6]. These indices favor a model with fewer freely estimated parameters, or equivalently, with a higher degree of freedom. The root mean square error of approximation (RMSEA) is one such index. Although it is sensitive to the model s number of freely estimated parameters, it is not overly sensitive to sample size. The RMSEA assesses the extent to which a model fits relatively well in the population [6, p. 83]. The RMSEA is calculated based on a noncentral χ 2 distribution that includes a noncentrality parameter (NCP) which expresses the degree of model misspecification [6, p. 83]. Letting d = χ2 df N 1 (2.8.18) 81

98 where N denotes the number of examinees, this index is calculated as d RMSEA = df. (2.8.19) Some authors include RMSEA in the absolute fit category, for example see [29]. Recommendations vary, suggesting RMSEA should be less than.07 to less than.06 with less than 0.03 representing excellent fit [29]. Comparative fit indices compare the fit of the specified or target model with a baseline model in which all inter-item covariances are set to zero. An example is Bentler s CFI [3]. Letting the subscripts, T, represent the target model and B, the baseline model, this index is calculated as CFI = 1 max [(χ 2 T df T ), 0] max [(χ 2 T df T ), (χ 2 B df B), 0], (2.8.20) and has a range of between 0 and 1, with a recommended value of CFI >.95 [6]. The Tucker-Lewis index (TLI), or the non-normed fit index (NNFI), is another comparative fit index [50]. Using the same notation as above, TLI = [(χ2 B /df B) (χ 2 T /df T )] [(χ 2 B /df. (2.8.21) B) 1] Since the TLI is non-normed, it may take on values outside of [0, 1], but the recommendations are similar to those for the CFI, with TLI > 0.95 representing good fit, however (r)ecommendations as low as 0.80 as a cutoff have been proferred [29, p. 55]. Advice varies on which indices to report, but it seems reasonable to report the following: χ 2 goodness of fit along with df and p value, RMSEA along with its confidence interval, 82

99 SRMR, CFI, and TLI. These indices have been chosen over other indices as they have been found to be the most insensitive to sample size, model misspecification and parameter estimates [29, p. 56]. Note that the exception in this list is the χ 2, which is very sensitive to sample size, but it is traditionally reported [6]. Another approach to assessing fit is Hu and Bentler s two-index presentation strategy [29], which offers three different groupings of two indices that can be used together to establish good fit, see Table 2.2. For example, one could use the RMSEA and SRMR together as evidence of good fit, even when the CFI and TLI are lower than desired. Table 2.2. Two-Index Presentation Strategy Hu and Bentler (1999) Fit Indices Guidelines TLI and SRMR TLI > 0.96 and SRMR < 0.09 RMSEA and SRMR RMSEA < 0.06 and SRMR < 0.09 CFI and SRMR CFI > 0.96 and SRMR < 0.09 Modification indices also provide information regarding the fit of a specified model. These indices can be viewed as indicating possible missing paths in the model; e.g, two items may be so similar that their correlation is due more to their similarity than to the underlying latent factor. In such a case, the items are acting like additional latent factors. The modification index gives an estimate of how much the χ 2 statistic would decrease should one specify a new path [6], and a large modification index indicates an item that could be removed to improve fit [29]. The results from CFA are often far from perfect. However, the above fit indices are meant to serve as a guide to model fit and should not be the sole consideration when making the decision about the dimensionality of an instrument. The researcher should have enough 83

100 prior knowledge to specify the model, and the results should make sense in terms of this prior knowledge [6]. Once IRT analysis has been conducted and parameter estimates have been obtained, there are model-data fit analyses that can provide additional evidence supporting the assumptions of unidimensionality and local independence Chi-square Goodness of Fit Indices. Once appropriate models have been chosen and item and person parameters have been obtained, there are several chi-square goodness of fit indices which can be used to examine fit at the item level. Examples of these tests are the χ 2 goodness of fit test [19], the Q 1 statistic due to Yen (1981), and the S χ 2 statistic due to Orlando and Thissen [40]. The pros and cons of these indices can be found in [47]. All of these type of statistical tests are sensitive to examinee sample size. When the sample is large (> 2000), one sees an increase in Type 1 error rates. In contrast, when the sample size is small (< 500), these tests have low statistical power, and misfitting items may not be identified. In addition, these tests do not provide a graphical display of misfit, which can be illuminating [47]. Such tests should be considered, though, and often when misfit occurs, one can examine the data for evidence of a reasonable explanation (e.g., misfit may occur only at the very high or very low end of the ability scale which may not be of concern to the researcher). All of these indices are based on residuals. Hambleton et al. state that, Perhaps the most valuable goodness-of-fit data of all are provided by residuals [23, p.66]. A residual represents the difference between the actual item response and the expected item response for a particular subgroup of examinees. For the Q 1 statistic, these subgroups are based on ability level and obtained by dividing the ability scale θ into k = 10 to 15 bins in such a way 84

101 that there are approximately the same number of examinees in each bin. For Xcalibre s χ 2 statistic, the range of θ is divided into k, presumably equal, intervals or bins, but the exact details are not provided [19, p.62]. Let O ik represent the observed proportion of examinees in bin k who answered item i correctly. Let E ik represent the expected proportion of correct responses. Note that the calculation of E ik varies by fit statistic but the residual is always r ik = O ik E ik. The χ 2 goodness of fit statistic reported by the software program Xcalibre is given by χ 2 = kq i=1 (O i E i ) 2 E i, (2.8.22) where k is the number of bins and q represents the number of response categories. The expected frequency is E i = NP i, where N represents the number of examinees in bin k and P i is the probability of response q for examinees in bin k [19]. The number of degrees of freedom is determined by the number of bins, k, the number of categories, q, and the number of parameters to be estimated. For example, for a 3PL item and 15 bins, there are typically 15 bins 2 response categories 3 parameters = 27 degrees of freedom. The degrees of freedom will vary by item, however, since a minimum value for E i is generally set. When there are at least 500 examinees, the minimum value is 5, and bins are collapsed to meet this criterion [19]. 85

102 Yen s Q 1 statistic is a modified chi-square statistic based on the sum of the squares of the standardized residual, which for a dichotomous item, i, is given by z ik = O ik E ik Eik [1 E ik ]/N k, (2.8.23) where N k represents the number of examinees in bin k. For this statistic, E ik is calculated using the chosen model at a θ value based on the mean ability in bin k or the average calculated probability of a correct response for each examinee in bin k. The Q 1 statistic can be used for both dichotomous and polytomous items. However, for polytomous items, the statistic is calculated for each response category separately. To obtain this statistic, the item responses are sorted according to the estimated ability estimates, ˆθ. Then 10 to 15 sized bins are created so that there are approximately the same number of examinees in each bin, and the observed and expected proportions correct are calculated. The observed proportion is simply a count of the number examinees who responded correctly (or gave the category response), and the predicted proportion is calculated using the chosen model and the average θ value in the bin. Then, the square of the standardized residual is calculated for each bin and summed to yield the desired statistic. If k is the number of bins, and m is the number of parameters used in the particular model, then Q 1 χ 2 with k m degrees of freedom [23]. Yen s Q 1 and the Xcalibre χ 2 fit statistics differ from Orlando and Thissen s S χ 2 primarily in the method of creating the bins and in the calculation of the expected proportions of examinees within each bin. With the Q 1 and χ 2 statistics, the examinees are divided into groups based on the ability estimates obtained from the IRT analysis. With the S χ 2 86

103 statistic, the examinees are divided into groups based upon their actual, i.e, observed test scores. The following description is from [31] and [40]. Letting I represent the perfect test score, the statistic for a dichotomous item, i, is calculated as I 1 S i χ 2 (O ik E ik ) 2 = N k E ik (1 E ik ), (2.8.24) k=1 where N k represents the number of examinees with score k, and O ik is the observed proportion of examinees with score k. These two values are calculated directly from the observed data. The category corresponding to a score of zero is omitted because the proportion of examinees who obtain a score of zero and respond to an item correctly is zero. The category for a perfect score is also omitted because this proportion would always be one. The expected proportion of examinees in score group k who answer item i correctly is given by E ik = Pi (θ)f i (k 1) θ)φ(θ) θ f(k θ)φ(θ) θ, (2.8.25) where P i (θ) is the probability of answering item i correctly given the ability level θ and is calculated based on the model used and item and ability parameters obtained [31]. The terms f(k θ) and f i (k θ) are the conditional predicted test score distributions with and without item i respectively. These distributions are calculated using a recursive algorithm described in the IRTFIT manual [4, p. 45] that was developed by Lord and Wingersky [33] and expanded upon by [48]. This fit statistic has been generalized and can be used with dichotomous or polytomous items, although its usefulness in real testing situations is still 87

104 Figure Misfitting Item: Q 1 Statistic under investigation [31]. In particular, when an polytomous item has several categories with few responses in some of these categories, the behavior of this fit statistic may change. Fit statistics are useful for identifying items that require closer examination; an item should not be discarded based solely on one of these fit statistics. Graphical analysis of observed responses and model predictions may reveal that the misfit lies in areas of little interest to the researcher, at the extreme low end of the ability scale, for example. Another consideration is how well the item fits in its range of maximum information. Figure 2.12 shows the observed and predicted proportions correct for an item along with its information. This item would be flagged as misfitting by the Q 1 statistic, but it does fit well in the region where the information is maximized. Misfit in polytomous items is often due to one or two categories with minimal observed responses, but inspection of the graphs may reveal that fit is good in other categories, and the item may actually function well Other Considerations. There are other methods available to assess model-data fit. In all cases, one begins by assessing the validity of the assumptions of the particular 88

105 model chosen. For example, one should check that the 3PL is appropriate for the multiple choice items, the 2PL fits the free response items, and that the GPCM is the best choice for the partial credit items. This can be done by comparing the relative fit of the different models. One must also consider whether examinees had adequate time to complete the test, that is, whether or not the test administration was speeded [23]. For the purpose of this research, this consideration is not an issue, since only those examinees with the lowest abilities failed to complete the test within the time allowed. When necessary, one can compare the variance of the number of unanswered items with that of the number of incorrect items. Ideally, this ratio is close to zero. Also of interest is the percentage of examinees who complete the test. Although it was not possible for the test used in this study, another way to determine whether time constraints are a factor is to compare results with timed and untimed examinees. Test fatigue is an additional factor to be considered, and it probably plays a role with a final exam consisting of fifty items. Unfortunately, in this study, the ordering of items (the last 13 of the 50 items are from new, previously untested material), made test fatigue a difficult factor to gauge. Item parameter invariance also provides a measure of model-data fit. After obtaining the results of item parameter and ability estimates with a particular model, one can divide the examinees into four, equally sized separate groups. First, create two groups of examinees randomly, and secondly, divide the examinees according to ability one group of high ability and one group of low ability. Then, using the same model, obtain parameter and ability estimates. Plot the baseline information of the difficulty parameter using the randomly 89

106 selected subgroups. There should be a high degree of correlation, and the plot of the two groups formed based on ability should be similar with similar degree of correlation. If this is not the case, then this is evidence that the wrong model is being used [23]. Similarly, items can be divided into groups randomly and based on difficulty. Similar plots are created, and one hopes to see similar results for the randomly chosen and difficulty based items. It can also be informative to plot the standardized residuals of individual items against the ability scale and compare the results obtained using different models Overview of Parameter Estimation Software: ICL and Xcalibre There are commercial and freely available software programs available for IRT analysis, and a good listing maintained by Institute of Objective Measurement can be found at [30]. The choice of software depends on the test being analyzed and the type of parameter estimation required by the researcher, for example, maximum likelihood estimation versus a Bayesian approach. For this study, a free software program was needed capable of analyzing a mixed format test with two types of dichotomous items and polytomous items with varying category numbers. Three different IRT models (2PL, 3PL, and GPC) were required, narrowing the choices. In addition, Bayesian parameter estimation techniques were required. A well known commercial software program with similar capabilities was also needed for the evaluation and comparison of results. The free software Item Response Theory Command Language (ICL) and the commercial software Xcalibre were chosen. Both of these programs can provide Bayes Modal parameter estimates 90

107 and EAP ability estimates. The remainder of this section describes each of the programs in terms of their default settings, estimation techniques, and required data preparation. ICL is the creation of Bradley A. Hanson ( ) [27] and was chosen for this analysis because it is free, easy to use, quick to converge, and because it allows for a Bayesian approach to parameter estimation: EAP estimates of ability parameters and Bayes modal estimates of item parameters. In addition, Jurich and Goodman have shown in a limited study that ICL and PARSCALE (commercial software from Scientific Software International) obtain item and ability parameters comparably in a mixed format test (25% polytomous, 75% dichotomous items), using the GPC and 2PL models [18]. ICL commands are processed by an embedded Tool Command Language (Tcl) interpreter [26, p. 40]. No knowledge of Tcl is required to use the program, but all the commands are available, and users familiar with Tcl can redefine commands to suit their needs. The manual for ICL contains references for those interested in learning the language [26, p. 40]. All the details of the software are available for interested users. ICL provides parameter estimates for dichotomous items (with 1PL, 2PL, and 3PL models) and polytomous items (with the PC and GPC models) and ability estimates for examinees. Standard errors for item parameter estimates may be obtained using ICL s bootstrap command. In addition, ICL can simulate item responses to items with known parameters. For 2PL and 3PL dichotomous items, the models used by ICL are those described in Section 2.2, where c j = 0 for the 2PL items. The default setting in ICL is D = 1.7. For the polytomous items, the model used is similar to (2.1.12). The probability of scoring in 91

108 category k of item j is given by p(y jk θ, a j, b jk ) = ( k ) exp h=0 a j(θ b jh ) ( mj i=0 exp i ). (2.9.1) h=0 a j(θ b jh ) Note that there is no scaling constant D = 1.7. ICL item parameter estimation requires the raw data in integer format. A partial credit item with three parts has the possible response codes of 0, 1, 2, or 3 and would be considered a 4-category item. For a test with J items and N examinees, the data consists of an N J matrix with J columns of item responses and N rows of response patterns, one for each examinee. One method of data preparation is to first modify a.csv Excel file of responses so that all responses are integer valued, then open it in a text editor, such as Notepad. Remove the commas, and save the file with a.dat extension. The command file for running the program is created with a text editor and saved with a.tcl extension. The command file used in this study can be found in Appendix A. Item types are designated in the command file, and the choice of model is indicated by an integer. All dichotomous items are coded as 1, and partial credit items are coded according to their number of categories. Another command line is required to designate the free response items as 2PL (since both multiple choice and free response items are dichotomous). ICL has the following default settings: The range of the ability parameter, θ, is (-4,4), the number of quadrature points used in the discrete latent variable distribution is 40, and the convergence criterion is (the maximum relative change in item parameter estimates, over all items, must be less than this criterion). The scaling constant default is D = 1.7 (this can be changed by the user), but this does not affect the GPC model. The default priors 92

109 are four parameter beta priors: a b(1.75, 3.0, 0.0, 3.0), b b(1.01, 1.01, 6.0, 6.0), c b(3.5, 4.0, 0.0, 0.5). (2.9.2) This distribution is characterized by 4 parameters: α, β, min and max, and is given by f(x) = (x min)α 1 (max x) β α, (2.9.3) B(α, β)(max min) α+β 1 where B(α, β) is the beta distribution. Both α and β must be positive and min < max. The domain of f is min x max. When α and β are both greater than one, the distribution is unimodal with the mode being min + α 1 (max min). (2.9.4) α + β 2 The mean is always and the variance is min + α (max min), (2.9.5) α + β αβ (α + β) 2 (α + β + 1) (max min)2. (2.9.6) Thus, for the a prior, the mean is 1.11, the mode is 0.82, and the standard deviation is These values for the b prior are respectively: 0, 0, 3.45, and for the c prior, respectively: 0.23, 0.23, In terms of percentiles (10th, 25th, 50th, 75th, 90th), for the a prior, these are 0.34, 0.62, 1.05, 1.53, 1.96, and for the c prior, these are 0.12, 0.17, 0.23, 0.30, The 93

110 b prior is nearly uniform on the interval (-6, 6) [26]. Other options for priors include normal and lognormal distributions, plus the user may choose to use no prior distribution. Item parameter estimates are obtained via a modification of the EM algorithm, an efficient method of estimating unknown parameters based on observed data. The observed data, in this context, are the responses by the examinees to the items and the unknown parameters are the item parameters of the model and the ability estimates of the examinees. The details of this algorithm, as it is applied in ICL, can be found in Section 2.5.1, p. 45. For initial parameter estimates, ICL computes a rough estimate of examinee ability based on the observed data using the PROX procedure (Linacre, 1994) [26, p. 27]. These estimates are used to obtain starting values for dichotomous items via nonlinear regression. No initial estimates are obtained for the GPC model; instead default starting values are a = 1, and the b i parameters are set at zero. Ability parameter estimates are obtained with maximum likelihood techniques or via EAP estimation. Some details of the EAP estimation process are in Section 2.5.2, p. 54. There are some disadvantages to using ICL. No graphing utilities and no model-fit analyses are included as are found in most commercial software. The lack of graphical output means that users must use other software to produce ICCs, OCCs, and test and item information graphs. Model-data fit analyses can be performed using R or other graphing and statistical software packages. Also, while it is evident that ICL is being used in current research and is cited frequently as a freely available open source IRT program (for examples see [56] [14] [35]), there is little, if any, technical support available to users. However, the manual provides an adequate number of examples, and with some computer background, 94

111 anyone should be able to use this program. The use of this program does, however, require that the user is more knowledgeable of Item Response Theory in particular and of statistical methods in general. Xcalibre is available from Assessments Systems Corporation (and has replaced MULTI- LOG) and was chosen because of its capabilities and reputation [19]. A recent report using known item parameters concluded that Xcalibre recovers stable and accurate estimates for dichotomous items with sample sizes of 300 or more and for polytomous items with sample sizes of at least 500, with increased accuracy for both item types when sample sizes are larger. The a and b parameter estimates correlated highly with the known values (the correlation was strongest for b parameters). The c parameter had the weakest correlation explained in part by the low variance in this parameter [20]. Earlier studies compared Xcalibre to BILOG, LOGIST, and ASCAL using the 3PL model with the conclusion that Xcalibre provided the most accurate parameter estimates [57]. Xcalibre can provide EAP and maximum likelihood estimates of ability parameters and Bayes modal item parameter estimates using techniques similar to those of ICL for a mixed format test (dichotomous and polytomous items). The 1PL, 2PL, 3PL and Rasch models are options for dichotomous items, and five models are available for polytomous items. The output is a professional quality report complete with graphical analyses and model-fit statistics, along with classical item and test statistics. Xcalibre has additional capabilities: ability estimation via weighted maximum likelihood and evaluation of differential item functioning 95

112 (DIF). The graphical user interface is intuitive, and the user manual is comprehensive providing several examples, guidelines for interpreting output, and some technical details are given in the appendix for the interested user. Item response data (alphabetical or numerical) can be read directly from a comma or tab delimited file. A separate file, called the control file, is used to input the necessary item information, such as type, number of alternatives, etc. (See Appendix A for the control file used in this study.) While Xcalibre will allow for a mixture of dichotomous and polytomous items, only one model can be selected for each of these item types. However, assigning 2PL items to the GPC model (this model reduces to the 2PL model when the number of categories is two) allows for the analysis of a mixture of three item types. The 3PL model (2.1.2) can be used for multiple choice items, and the GPC model (2.1.12) can be used for partial credit items. The scaling constant can be set as D = 1 or D = 1.7 and applies to all models. (Unlike ICL, Xcalibre does include the scaling constant D in the GPC model.) Xcalibre is proprietary software, and the exact details of the parameter estimation algorithms are not available. The manual describes the basic steps of parameter estimation and refers the users to several publications containing more details of Xcalibre s application of the EM algorithm [19]. Both Xcalibre and ICL obtain Bayes modal item parameter estimates via the EM algorithm, but the implementation may not be identical. Initial item parameter estimates in Xcalibre are calculated from the classical proportion-correct and biserial-correlation statistics using transformations suggested by Jensema (1976) [19, p. 59]. The details of the procedure are outlined in the manual. The EAP ability parameter estimation techniques appear to be equivalent to ICL s. 96

113 Many of default settings for Xcalibre differ from those of ICL. For item parameter estimation, the range of θ is ( 7, 7), the number of quadrature points in the discrete latent variable distribution is 20, and the convergence criterion is The default priors for Xcalibre are a Log-N(1.00, 0.30), b N(0.0, 1.0), and c N(.25, 0.03). (Note: The documentation accompanying Xcalibre states that the prior for the guessing parameter is c N(0.20, 0.030), however this was not the default when the program was used for this study [19, p. 61].) Alternate distributions are not available, but users can change the means and the standard deviations or use non-bayesian estimation techniques without priors. Another option is using a prior mean equal to 1 for the c parameters. Item (number of options) parameters have default upper and lower bounds: a (0.05, 6), b ( 4, 4), and c (0, 0.7), and the bounds of these parameter ranges represent the value at which a parameter will be fixed should its estimate become extreme. EAP ability estimation uses the same the defaults as ICL: forty quadrature points are used for θ ( 4, 4). 97

114 CHAPTER 3 METHODS The primary goal of this research was to determine if the freely available software, ICL, produces results comparable to the commercial software Xcalibre, using the data from a college algebra final exam. The Department of Mathematics at a large southern research university provided this data, excluding any examinee identification fields, as an Excel.csv file. First, the assumptions of unidimensionality and local independence were verified using techniques from exploratory and confirmatory factor analysis. Item and ability parameter estimates were obtained from each program, and the results from ICL were compared to those of Xcalibre, treating the Xcalibre estimates, in some sense, as known values. Modeldata fit analysis was conducted for both the ICL and the Xcalibre results, with a focus on items that functioned differently, e.g., an item that was flagged as misfitting for one program but not the other The Instrument: College Algebra Final Exam The instrument was a college algebra final exam consisting of thirty multiple choice, ten free response, and ten partial credit items taken by a total of 1412 students from twenty sections of the same course. This was a common exam, and it was administered online in a proctored computer lab environment over a period of six days (final exam week). The exam was offered at several pre-set time slots each day, and students chose in advance the time 98

115 that best fit their schedule. Once students began the exam, they were allowed a maximum of 2.5 hours for completion. The fifty items on the exam were selected from a test bank of questions, and each item had several iterations so that all students did not receive exactly the same exam. For example, two iterations of the first item were: 1. 10% of what number is 88? The number is. Answer: 880 ID:1.3.7 and 1. 50% of what number is 76? The number is. Answer: 152 ID:1.3.7 Most items appeared previously on one of the four semester tests or on a quiz, and students were familiar with the test environment, procedures, and the computer set up prior to taking their final exam Unidimensionality and Local Independence The unidimensionality of the college algebra exam was verified using the techniques of exploratory factor analysis available with the stats package of R and the command factanal() [42]. The raw data was a matrix of 50 columns, one item per column and one row per examinee. Each entry was a 0 or a 1 for a dichotomous item. For polytomous items, the values varied according to the number of categories present in each item. For example, Item 16 has 3 parts. Thus the possible responses are 0, 0.333,.0667, and 1. The command 99

116 factanal() provides eigenvalues of the correlation matrix that were used to produce a scree plot. To verify local independence, confirmatory factor analysis was conducted, specifying a one factor model, using the sem() package of R [16]. The resulting model fit statistics were analyzed according to the guidelines in Section 2.8.1, in particular, the Two-Index Presentation Strategy Hu and Bentler (1999) (see Table 2.2 on p. 83). Modification indices were examined to find any large indices indicative of items that should be removed Item and Ability Parameter Estimation The original data of 1412 examinee responses to the 50 items on the college algebra exam was provided as an Excel.cvs file where responses to dichotomous items were 0 or 1, and partial credit items had varying responses depending on the number of categories. For example, a partial credit item with four parts would have five possible response categories: 0, 0.25, 0.5, 0.75, and 1. For the ICL parameter estimation, the raw test data was modified so that all responses were integer valued; for example,the responses to a a partial credit item with 4 parts were changed to response codes of 0, 1, 2, 3 or 4 (a 5-category item). The.csv Excel file of modified responses was then opened in a text editor, commas were removed, and the file was saved with a.dat extension. The command file for running the program was created with a text editor and saved as a.tcl file, which is in Appendix A. The default settings of ICL were used, primarily because it was decided that these would be the settings the average user would employ. In particular, the range of the ability parameter, θ, was ( 4, 4), the number 100

117 of points in the discrete latent variable distribution was 40, and the convergence criterion was (the maximum relative change in item parameter estimates, over all items, must be less than this criterion in order for the estimation process to terminate). The scaling constant default, D = 1.7, was used, but since this did not affect the GPC model, the a parameter estimates for partial credit items were adjusted by dividing by this factor after they were obtained. Graphs of the IRFs, test information, and item information were produced using R and Excel. Standard error estimates for the item parameter estimates were obtained via a bootstrapping command found in ICL. Two hundred replications were deemed to be sufficient based on recommendations in [12], and the standard error was calculated using the equation that follows, also from [12]. For example, to calculate the standard error for the a parameter, let B = 200 represent the number of bootstrap samples selected. Let â(b) be the a parameter estimate for each sample b = 1, 2,, 200, and let â( ) = B b=1 â(b)/b be the average of the 200 bootstrap a parameter estimates. Then, the standard error for the a parameter estimate is given by { B se a = [â(b) â( )] 2 /(B 1) b=1 } 1 2. (3.3.1) Similar calculations were done for each item parameter. The raw test data was prepared similarly for Xcalibre estimation, but left as an Excel.cvs file. The designation of item types was done in a separate file, called a control file, which is in Appendix A. Two-parameter items were assigned to the GPC model since this model reduces to the 2PL model when the number of categories is two. Multiple choice items were assigned to the 3PL model (2.1.2), and partial credit items were assigned to the GPC model 101

118 (2.1.12). The scaling constant D was set to 1.7 (and this applied to the GPC model as well). The following default settings for Xcalibre were changed to match those of ICL: The range of θ was changed from ( 7, 7) to ( 4, 4), the number of quadrature points in the discrete latent variable distribution was changed from 20 to 40, and the convergence criterion was changed from 0.01 to The default priors and the default upper and lower bounds for parameter estimates were used. Bayes modal item parameter estimates and EAP ability estimates were obtained from both programs Comparison of Parameter Estimates Item parameter estimates were compared in several ways. Confidence intervals were created for the Xcalibre item parameter estimates based on the standard errors provided by the program. For example, for Item 3, a 95% confidence interval for the a parameter was calculated as ( , ) = (0.370, 0.677). This confidence interval was based on the t-distribution since both the actual mean and variance are unknown. However, with this sample size (N = 1412 examinees), the critical tα 2 value was equivalent to that of the normal distribution. Items found to have ICL parameter estimates outside the Xcalibre confidence intervals were further examined by comparing the IRFs from the two programs. Different parameter estimates can yield nearly identical IRFs (see Figure 4.2 in Section 2.2). Thus, some differences in parameter estimates may be of little concern when the IRFs differ in a range of θ where very few, if any, examinees are located. The IRFs were 102

119 compared in three ways: by visual inspection of the graphs, by the calculation of the root mean squared error (RMSE), and by the calculation of a weighted version of the root mean squared error (W-RMSE). Visual inspection of the IRFs provided an initial impression of the degree and location of their differences. The RMSE is a standard statistic that is reported in most IRT research and is useful in determining to what degree two IRFs differ. The RMSE for an item j was calculated using 56 evenly spaced points over the θ range of ( 3.3, 2.3). Thus the RMSE for item j is given by RMSE j = 1 56 (P ICL (θ i ) P Xcal (θ i )) 2, (3.4.1) 56 i=1 where P ICL (θ i ) and P Xcal (θ i ) represent the probability of a correct response at θ i, using the model chosen for item j and the item parameter estimates of ICL and Xcalibre respectively. The RMSE was calculated for all items, and graphs were produced comparing the resulting IRFs for both programs. In this context, the RMSE takes on values in [0, 1] with zero indicating identical curves. There is no standard guideline for what is an acceptable range for the RMSE, though it was decided that an RMSE < 0.03 was desirable. One disadvantage of the RMSE is that it does not factor in the location of the differences in the IRFs. For this reason, a weighted error statistic was calculated that generates an L 2 norm: { P ICL (θ) P Xcal (θ) 2 dµ Ω } 1 2 (3.4.2) 103

120 where µ is the probability distribution of θ, and Ω is the support of θ. These values were calculated using the same range of θ, where for an item j W-RMSE j = { 56 (P ICL (θ i ) P Xcal (θ i )) 2 µ(θ i ) i=1 } 1 2 (3.4.3) where µ(θ i ) is the probability of an examinee having ability θ i based on the ability distribution obtained from the Xcalibre results. (The ability distributions from the two programs were nearly identical, with Xcalibre having a slightly larger range of θ values.) The range of W-RMSE was also [0, 1] with zero representing identical curves. A value of less than 0.05 was deemed desirable, and items with W-RMSE > were examined closely. For example, the item in Figure 3.1 had an RMSE of 0.032, but a W-RMSE of 0.014, reflecting the fact that the IRFs differ in the region of θ, where few examinees are located. Figure 3.1. RMSE and W-RMSE 104

121 Histograms of the frequency distributions of the a, b, and c parameter estimates and their summary statistics were used to compare the distributions of the ICL parameter estimates with those of Xcalibre. Differences in test information were examined by plotting test information function, TIF, for both programs along with the standard error of measurement (SEM). Note: In the Xcalibre manual, the SEM is called the conditional model-predicted standard error of measurement (CSEM) [19, p 65]. The estimates of the ability parameters for each examinee were compared along with the resulting frequency distributions and summary statistics. The Test Response Functions and standard error graphs were also compared using similar techniques as described above (using visual inspection and calculating the RMSE of the curves from different programs) Model-Data Fit To assess model-data fit, two chi square tests were calculated: Yen s Q1 statistic as described in [23] and [40] and Thissen s S χ 2 statistic as described in [40] and [47]. Of interest were items that differed in terms of fit, for example, an item that was flagged as misfitting for one software program but not the other. These items were further examined graphically. In the case of the Q 1 statistic, graphs were created showing the observed versus predicted responses which were used to calculate the statistic. Since this information was not available for the S χ 2 statistic, the IRFs were compared graphically and quantitatively using the RMSE and W-RMSE. In both cases, parameter estimates were compared in relation to the item s type and content. 105

122 Yen s Q1 statistic is a modified chi square statistic based on the sum of the square of the standardized residuals (Section 2.8.2, p. 84): z ik = O ik E ik Eik [1 E ik ]/N k. The Q1 statistic can be used for both dichotomous and polytomous items; however, for polytomous items, the statistic is calculated for each response category separately. To obtain this statistic, the item responses were sorted according to the ability estimates of the examinees, ˆθ. Fourteen bins were created with an average of examinees in each, and the observed and expected proportions correct were calculated. The observed proportion was a count of the number examinees who responded correctly (or gave the category response), and the predicted proportion was calculated using the appropriate model and the average value of θ in the bin. Then the square of the standardized residual was calculated for each bin and summed to yield the desired statistic. If k is the number of bins, and m is the number of parameters used in the IRT model, then Q 1 χ 2 with k m degrees of freedom. Note that for both the Xcalibre and ICL Q 1 calculations, the three examinees with the lowest scores were omitted, because including them would have caused the average θ value for the first bin to be unusually low. These three examinees had a scores of 15, 20, and 21 (out of a total of 100 points) on an exam of fifty items. The S χ 2 statistic (see Section 2.8.2, p. 87) was calculated using a free SAS macro called IRTFIT [4]. 106

123 CHAPTER 4 RESULTS The IRT software programs, Xcalibre and ICL, provided ability and item parameter estimates based on the observed responses from 1412 examinees to 49 items on a college algebra exam (one item was removed, see Section 4.1. There are numerous ways to analyze the results from IRT analysis, but the primary interest were results that highlighted any differences between the two programs. Neither software program was used to assess unidimensionality and local independence, but these results are reported as it was necessary to establish these assumptions to a reasonable degree before proceeding with parameter estimation. The results that highlighted differences in model-data fit, item parameter estimates (and their resulting IRFs), and ability parameter estimates are reported in the following sections, with more complete tables and graphs available in the appropriate appendices Unidimensionality and Local Independence Exploratory factor analysis of the final exam responses showed that the resulting correlation matrix had twelve eigenvalues greater than one: 8.03, 1.69, 1.45, 1.35, 1.29, 1.17, 1.15, 1.12, 1.09, 1.06, 1.05, However, the scree plot, see Figure 4.1, showed a marked drop off between the first and second eigenvalues and no other significant change afterwards. For confirmatory factor analysis, a one factor model was specified, and the fit and five highest modification indices are in Table 4.1. The two measures of absolute fit, SRMR and 107

124 Figure 4.1. Scree Plot: All Items RMSEA, indicated good model fit. However, the comparative fit indices suggested a possible problem with the model specification. For acceptable fit, both the Bentler and Tucker-Lewis indices should be greater than.90, and it is preferable to have the adjusted goodness of fit index greater than Table 4.1. Confirmatory Factor Analysis: All Items Fit Indices Modification Indices Index Value Path Value Adjusted goodness of fit index Item 6 Item RMSEA Item 2 Item SRMR Item 37 Item Bentler CFI Item 33 Item Tucker-Lewis NNFI Item 36 Item The modification indices in Table 4.1 revealed large values for Item 2 and Item 6, and it was decided that one of these items should be removed from the test to preserve local 108

125 independence. These items are very similar both ask for the width and length of a rectangular surface. Item 6 was removed because of its reference to a barn an object that may be unfamiliar to some examinees. These items are presented here for reference. 2. The length of a new rectangular playing field is 8 yards longer than quadruple the width. If the perimeter of the rectangular playing field is 566 yards, what are its dimensions? The width is yards. The length is yards. Answers: 55, 228 ID: [5] 6. The area of a rectangular wall of a barn is 65 square feet. Its length is 8 feet longer than the width. Find the length and width of the wall of the barn. The width is The length is feet. feet. Answers: 5, 13 ID: [5] After the removal of Item 6, exploratory and confirmatory analysis was repeated on the 49 item instrument. The scree plot shown in Figure 4.2 indicated one dominant latent factor, but there were still twelve eigenvalues greater than one: 7.84, 1.67, 1.41, 1.35, 1.26, 1.16, 1.13, 1.11, 1.08, 1.05, 1.04, Table 4.2 contains the fit and modification indices from the confirmatory factor analysis with Item 6 removed. There was little change in the fit indices, and, again, only the absolute fit indices indicated good model fit. However, using the fit guidelines in Table 2.2, p. 83, these indices, along with a clear cut scree plot, provided enough evidence of unidimensionality and 109

126 Figure 4.2. Scree Plot: Item 6 Removed local independence to proceed with IRT analysis. The model-data fit analyses of the results provided further evidence that these assumptions were met. Table 4.2. Confirmatory Factor Analysis: Item 6 Removed Fit Indices Modification Indices Index Value Path Value Adjusted goodness of fit index Item 37 Item RMSEA Item 33 Item SRMR Item 20 Item Bentler CFI Item 17 Item Tucker-Lewis NNFI Item 36 Item Model-Data Fit Xcalibre produced a report that included item and whole test fit statistics with none of the 49 items begin flagged as misfitting (see Table C.1). However, the two fit statistics, Q 1 110

127 and S χ 2, did identify items with misfit. A summary of the results is in Table C.2. In that table, Both means that misfit was detected for both the ICL and the Xcalibre estimates. The complete table of Q 1 and S χ 2 statistics can also be found in Appendix C. The items of interest were those that highlighted potential differences in the results across the two software programs. In terms of model-data fit, these were items identified as misfitting for one program but not the other. Eight of the forty-nine items had misfit for Xcalibre, but not ICL: Items 4, 14, 21, 23, 31, 41, 45 and 47. Items 16 and 40 showed misfit for ICL but not Xcalibre. The fit statistics for these ten items are in Table 4.3, where the area of different fit is in bold. These items are the focus of the following discussion. Table 4.3. Items of Interest Xcalibre ICL Xcalibre ICL Item df S χ 2 p df S χ 2 p df Q 1 p Q 1 p Five of these items showed misfit according to the S χ 2 statistic for Xcalibre only: Items 14, 21, 31, 41 and 45. All of these items fit for both programs according to the Q 1 statistic. These items are presented with discussion, beginning with Item 14, shown below. 111

128 14. Begin by graphing the standard cubic function f(x) = x 3. Then use transformations of this graph to graph the function given below. h(x) = 3x 3 Choose the correct graph of h(x) below. Answer: D ID: [5] Note that some examinees should have been able to eliminate two distractors easily based on the orientation of the graph, but since there were a range of versions of this item with the coefficient of x varying as 6, 5, 3, 2, 1 6, 1 2, or 1, choosing the correct graph with 3 rational coefficients was probably more difficult. Thus, guessing may not have been as easy as the above version of this item implies. From the graphs in Figure 4.3, it is evident that this is a poor item in terms of discrimination, but the IRFs from the two programs were Figure 4.3. Item

129 nearly identical with very low RMSE and W-RMSE values. The parameter estimates were for Xcalibre: a = 0.39, b = 0.76, c = 0.26, and for ICL these were: a = 0.35, b = 1.0, c = The lower c parameter for ICL suggests that guessing was less successful, and ICL interpreted this as a less difficult item. (See Section 2.2, p. 18, for details on interpreting item parameters.) Item 21 is a five option multiple choice item: 21. Use the leading coefficient test to determine the end behavior of the graph of the given polynomial function. f(x) = 4x 7 + 2x 6 + 6x A. Falls left & rises right. B. Falls left & falls right. C. Rises left & rises right. D. Rises left & falls right. E. None of the above. Answer: A ID: [5] Figure 4.4. Item 21 Figure 4.4 and the W-RMSE value of show that the IRFs were similar for both programs except in the lower range of ability. The parameter estimates for Xcalibre were a = 0.53, b = 113

130 0.73 and c = 0.25 and for ICL were a = 0.50, b = 0.91 and c = This item had good fit according to the Q 1 statistic for both programs. Item 31 is a six category partial credit item: 31. Graph the given function by making a table of coordinates. f(x) = 3 x Complete the table of coordinates. (Type integers or fractions. Simplify your answers.) x y Choose the correct graph below. Answers: 1 9, 1, 1, 3, 9; D ID: [5] 3 The number of responses in each category from zero to six were 6, 7, 24, 86, 59, 1227, and 87% of examinees answered this item correctly. The item was further complicated by the fact that one response category involved guessing with four distractors, two of which could be eliminated by some examinees. Because of the low number of responses in the first three categories, Figure 4.5 shows only the last three response categories. The IRFs were very similar, especially in the region of θ >

131 Figure 4.5. Item 31 Item 41 is a four option multiple choice item: The logistic growth function f(t) = describes the e 0.22t population of a species of butterflies T months after they are introduced to a non-threatening habitat. How many butterflies are expected in the habitat after 20 months? Round to the nearest whole number. A. 274 butterflies B. 8,000 butterflies C. 474 butterflies D. 374 butterflies Answer: D ID: [5] Except in the lower ability range, the IRFs from the two programs were similar as shown in Figure 4.6, but only the Xcalibre results did not fit according to the S χ 2 statistic. 115

132 Figure 4.6. Item 41 The c parameter estimates differed most: c = 0.24 for Xcalibre and c = 0.16 for ICL. This item was not on a previous test, and it could be that the distractors were quite good as implied by ICL (see Section 2.2, p. 18, for details regarding the interpretation of the c parameter). The RMSE for this item was 0.032, the third highest overall. Another complicating factor for this item is that it came from material at the end of the course, and it appeared at the end of a long exam. In addition, the key to being able to answer this item correctly is often a matter of mastering the calculator used in this course. Item 45 is a six option multiple choice question. The fit, according to Q 1 statistic, was nearly identical, but the p value for the S χ 2 statistic for Xcalibre was zero and for ICL, p = This item was from the last section of material covered in the course and had not appeared on a previous test. Some examinees probably did not bother to master this last section of material, while others may have memorized the patterns that are associated with each type of partial fraction decomposition. The graphs in Figure 4.7 are similar, but the parameter estimates were not. For ICL, these were a = 0.48, b = 0.38 and c = 0.16, but for 116

133 Xcalibre, these were a = 0.51, b = 0.12, and c = Note that with six good distractors, one would expect a c value close to that of ICL s (see Section 2.2, p. 18, for details regarding the interpretation of the c parameter). The item is presented below. 45. Write the form of the partial fraction decomposition of the rational expression. It is not necessary to solve for the constants. 7x 2 3x + 5 (x 9)(x 2 + 9) What is the form of the partial fraction decomposition of the given expression? A. A x 9 + Bx + C x Dx + E (x 2 + 9) 2 B. Ax + B x 9 C. + Cx + D x A x 9 + Bx (x 9) + Cx + D (x + 9) D. E. F. A x 9 + Bx (x 9) + C x A x 9 + B x A x 9 + Bx + C x Answer: F. ID: [5] Figure 4.7. Item

134 The IRFs were similar, except in the region θ < 1.5, but as the W-RMSE value indicated, this was a region with few examinees. Next consider, in order, the three items that fit the ICL but not the Xcalibre results according the to the Q 1 statistic: Items 4, 23, and 47. Item 4 is a four option multiple choice item: 4. Divide and express the result in standard form. 10i 3 i i i i i Answer: A ID: [5] Figure 4.8. Item 4 Figure 4.8 shows that the fit was quite similar from both programs, especially in the region of high examinee numbers, and the Q 1 statistics were nearly identical (see Table 4.3) with the first and last residuals accounting for much of the difference. ICL s parameter estimates were a = 0.78, b = 0.21, and c = 0.35 and for Xcalibre, a = 0.65, b = 0.47, and c = 0.27, reflecting a different interpretation of the item. However, the IRFs were similar except in 118

135 the low ability range with RMSE = and W-RMSE = (see Figure??, p.??). Item 23 is also multiple choice with four options: 23. Divide using long division. 1. 2x 2 63x x x x x x x x 2 12x 3 2x 3 75x + 18 x 6 Answer: B ID: [5] The parameter estimates for Item 23, and the IRFs (see Figure??, p.??) are almost identical: For ICL, a = 0.60, b = 1.32 and c = 0.24 and for Xcalibre, a = 0.60, b = 1.32, and c = 0.25, with an RMSE of and a W-RMSE of for the IRFs. The differences in the observed response at bins 5, 8, and 9 caused the Xcalibre Q 1 statistic to be high enough to be flagged as misfitting. Figure 4.9. Item

136 Item 47 is multiple choice with four options, and like Item 45, it was based on material covered at the end of the course: 47. Write the partial fraction decomposition of the rational expression. A. B. C. D. x 8 (x 3)(x 4) 4 x x 4 5 x x 4 4 x x 4 5 x x 4 Answer: B ID: [5] Figure Item 47 Figure 4.10 shows similar fit for both programs. Only 682 or 48.3% of examinees answered correctly, making it one of the more difficult items. The parameter estimates were nearly identical: a = 0.95, b = 0.75, c = 0.26 for Xcalibre and a = 1.1, b = 0.73, c = 0.27 for ICL, 120

137 resulting in almost identical IRFs with an RMSE of and a W-RMSE of The observed response at bins 7 and 14 caused the item to be flagged as misfitting for Xcalibre. The next two items fit the Xcalibre results, but not the ICL results according to the Q 1 statistic. Item 16 is partial credit with four categories, and the misfit for ICL was in response category two. 16. For f(x) = 5x and g(x) = x + 6, find the following: a. (f g)(x) = (Simplify your answer.) b. (g f)(x) = (Simplify your answer.) c. (f g)(2) = (Simplify your answer.) Answers: 5x + 30, 5x + 6, 40 ID: Figure Item 16: Response Category Two There were 114 examinees who answered in response category two, or approximately 8% of the examinees, and these low numbers made parameter estimation difficult. Even so, the parameter estimates from both programs were similar for all categories as were the IRFs. For ICL, a = 0.32, b 1 = 0.08, b 2 = 1.07, and b 3 = For Xcalibre, a = 0.31, b 1 = 0.01, b 2 = 1.15, and b 3 = The RMSE was 0.003, and the W-RMSE was.001 for 121

138 IRFs of response category two. It was primarily the observed response in bin 2 that caused this item to be flagged as misfitting for ICL. Item 40 was problematic. 40. The half-life of the radioactive element unobtanium-53 is 20 seconds. If 80 grams of unobtanium-53 are initially present, how many grams are present after 20 seconds? 40 seconds? 60 seconds? 80 seconds? 100 seconds? The amount left after 20 seconds is The amount left after 40 seconds is The amount left after 60 seconds is The amount left after 80 seconds is The amount left after 100 seconds is grams. grams. grams. grams. grams. (Round to one decimal place.) Answers: 40, 20, 10, 5, 2.5 ID: [5] Figure Item 40: Response Category Zero This is six category partial credit item, and the responses in categories 0 through 5 were 110, 19, 5, 0, 26, 1252, with 89% of examinees answering correctly. Xcalibre automatically excludes any partial credit item with no responses in one or more categories. Thus, this 122

139 item was changed to prevent exclusion by Xcalibre by converting it to a four category item as follows: Categories 1 and 2 were combined, as were categories 3 and 4. Even with this modification, some categories had low examinee numbers: There were 110, 24, 26, and 1252 examinees in response categories 0, 1, 2, and 3 respectively. The misfit for ICL was in response category zero only, shown in Figure 4.12, where two large residuals are evident at bins 1 and 3. The parameter estimates did not match well. For Xcalibre: a = 0.27, b 1 = 2.32, b 3 = 0.71, and b 3 = 8.47, and for ICL: a = 0.35, b 1 = 1.53, b 2 = 1.22, and b 3 = 6.0. In addition, the IRFs for Item 40 had the highest RMSE and W-RMSE values of all 49 items, and in response category zero, the RMSE was and the W-RMSE was Item Parameter Estimates Item parameter estimates were obtained from ICL and Xcalibre using the same test data. Two modifications were made to the original data. Item 6 was omitted for reasons outlined in Section 4.1, and Item 40 was changed from a 6 category to a 4 category GPC item for reasons outlined in the previous section. Tables of all parameter estimates for both programs are in Appendix D. Xcalibre provides item parameter standard errors that can be used to obtain confidence intervals for parameter estimates. All item parameter estimates and standard errors are reported in Tables D.2, D.3, D.4, and D.5 (in Appendix D), where the items are grouped by type. A summary of this information is presented below in Table

140 Table 4.4. Item Parameter Estimates: Standard Errors ICL Xcalibre Item Parameter Mean SE Mean SE 3PL a b c PL a b GPCM a b b b b b b Different parameter estimates can yield nearly identical IRFs (ICCs and OCCs), as discussed in Chapter 2, Section 2.2. Since it is the IRF that characterizes the probability of a correct response, some differences in parameter estimates may be of little concern provided IRFs are similar. In addition, these differences may cause the IRFs to differ only in a range of θ where very few, if any, examinees are located. For these reasons, two statistics were calculated to compare the IRFs: the RMSE and W-RMSE. Both of these statistics are described in Chapter 3, Section 3.4. The W-RMSE is weighted via the probability distribution and yields a more meaningful statistic in terms of the difference between two IRFs. 124

141 Comparison of Item Parameter Estimates. The item parameter estimates for each item type were compared directly by determining which ICL item parameter estimates fell outside the confidence intervals (CIs) created using the Xcalibre item parameter estimates and standard errors. First consider the thirty 3PL items, and note that the standard errors of the a parameter estimates were similar for both programs, but the b and c parameters were not. Xcalibre had a higher standard error in the c parameter while ICL had a higher standard error in the b parameter estimates. Three items had ICL a parameter estimates that did not lie within the Xcalibre confidence intervals: Items 4, 26, and 47. Eleven items had ICL b parameter estimates that did not lie within the Xcalibre confidence intervals: Items 4, 9, 10, 14, 21, 28, 41, 42, 45, 46, and 49. This high number of items was due, in part, to the small standard error for the Xcalibre b parameter estimates. All the ICL c parameter estimates fell within the confidence intervals of Xcalibre due to the large standard error for the Xcalibre c parameters. The graphs in Figures 4.13 and 4.14 show the IRFs for both the Xcalibre and ICL parameter estimates for these 13 items (Item 4 had a and b parameters outside of Xcalibre s confidence intervals). Despite the differences in parameter estimates, the IRFs were quite similar. The RMSE and W-RMSE, which provided a measure of the average distance between the curves, are reported for each pair of graphs. Note that in many items, the difference in the IRFs was limited to θ < 1.5, where only 5.7% of examinees fell. Further, only 2% of examinees fell in the region θ < 2.0 where the differences were most noticeable. 125

142 Figure PL Items: Parameters Outside Xcalibre CIs (1 of 2) 126

143 Figure PL Items: Parameters Outside Xcalibre CIs (2 of 2) 127

144 Items 10, 28, and 41 stand out with their relatively high values for the RMSE. (Note: Item 41 was discussed in the previous section, see p. 116.) Item 10 is multiple choice with four options: 10. Solve the absolute value inequality using an equivalent compound inequality. Other than, use interval notation to express the solution set and graph the solution set on a number line. 7x < 7 Answer: D ID: [5] Xcalibre gave parameter estimates of a = 0.61, b = 0.96, and c = 0.24, while ICL gave a = 0.60, b = 1.14, and c = Even though the ICL c parameter estimate fell within Xcalibre s confidence interval, the difference between the two c parameter estimates was the primary cause of the high RMSE value. This item was not flagged as misfitting for either program. Item 28 is a four option multiple choice item. The parameter estimates for ICL were a = 0.68, b = 0.65, and c = For Xcalibre, these were a = 0.55, b = 0.44, and c = Fifty eight percent of examinees responded correctly, and both programs interpreted it as one of the more difficult items, but with low discrimination. (See Section 2.2, p. 18 for details regarding item parameter interpretation.) Since it is likely many examinees could have eliminated two distractors based on vertical asymptotes, a higher value than 0.25 for 128

145 the c parameter is plausible. While this item fit for both programs according to the Q 1 statistic, it did not fit for either program according to the S χ 2 statistic. The RMSE of the two ICCs was 0.025, one of the top eleven highest RMSE values, but the W-RMSE value of and the graphs in Figure 4.14 show that the misfit was concentrated at the high and low ranges of θ where few examinees were located. The item is presented below. 28. Graph the rational function. f(x) = x2 x 6 x 2 1 Answer: A ID: [5] The parameters for the nine 2PL items are in Table D.3. All of ICL s a parameter estimates fell within the Xcalibre confidence intervals, and only one item, Item 33, had a b parameter estimate outside the confidence interval. For this item, the b parameter estimate for ICL was 1.198, and for Xcalibre, b = 1.324, and Xcalibre s small standard error, 0.046, was the reason the ICL estimate was outside the interval. In terms of absolute percent error, the difference was 9.5%. This item had the second highest RMSE, as is evident in Figure 4.15 and the W-RMSE value of was the second highest value of all items. The item and the IRFs are presented below. 129

146 Figure Item Evaluate the following expression without using a calculator. log 2 2 = Answers: 1 2 ID: [5] The average ability level for an examinee who missed this item was θ = 1.14, a failing score, and the average for those who answered it correctly was θ =.224, a passing score. Eighty-three percent of examinees answered this item correctly. Of the ten partial credit items (see Tables D.4 and D.5), two had ICL parameter estimates that fell outside the Xcalibre confidence intervals: Item 2 and Item 40. For Item 2, this occurred only for the b 2 estimate. For Item 40, this was the case for all parameters: a, b 1, b 2, and b 3. The graphs of the IRFs for both items and both programs are in Figure The absolute percent difference for the b 2 parameter of Item 2 was 1.7%. The reason the ICL estimate fell outside of the Xcalibre was due to the relatively small standard error for 130

147 the b 2 parameter. The RMSE for Item 2 categories 0, 1, and 2, respectively, are 0.017, 0.001, and 0.018, with W-RMSE values of 0.012, 0.001, and There was very little difference in the curves for θ > 1.6, where 95% of examinees were located. Figure Item 2 and Item 40 Item 40 was a problematic item, as has been discussed previously. The two programs interpreted this item differently, but it is not a good item and should be discarded or modified further. Not surprisingly, this item had the largest RMSE of all 49 items. Specifically, for categories 0-4, the RMSE was 0.131, 0.018, 0.012, 0.156, with the first and last categories being the highest, as is clear from the graph. For reference, Item 40 can be found at the end of Section 4.2. Frequency histograms for the 3PL item parameters are in Figure In particular, the c parameter histograms highlight the differences in the distribution of c parameter estimates. For Xcalibre, twenty-eight of thirty 3PL items had a c parameter estimate between 0.24 and The a and b parameters had similar distributions for both programs. The mean values, standard deviations, and the RMSE s for all parameter types are in Table 4.5. For the 3PL items, the primary difference between the two programs was found in the b and c parameter estimates. The RMSE of the b 1 parameter estimates was 0.276, the 131

148 highest of these values in Table 4.5. Item 40 contributed to this high RMSE value, and if this item was excluded from the calculation, the RMSE would have been The partial credit items showed higher RMSE for the b i parameters in general, and this trend can be attributed in a large part to Item 40, which affected the RMSE for b 1 through b 4. There were two seven category items with b 5 and b 6 parameters, Items 22 and 31, and these parameter estimates were similar (see Table D.5). Note that, in Table 4.5, the correlation between the ICL and Xcalibre estimates was high for all parameter types, with the lowest value found in the c parameter. Figure Frequency Histograms of Parameter Estimates: 3PL Items 132

149 Table 4.5. Item Parameter Estimates: RMSE ICL Xcalibre Parameter Mean SD Mean SD RMSE Correlation a b c b b b b * b * b * *three or fewer items Comparison of Item Response Functions. The graphs of the IRFs for all items are in Appendix E. The graphs were compared visually and quantitatively by calculating the RMSE and the W-RMSE for each pair of IRFs from the two programs (see Section 3.4 for details of these error measures). Table E.1 contains the values for all items. Eleven of the forty-nine items had an RMSE value greater than or equal to 0.025, and these values are shown in Table 4.6. This table also shows the W-RMSE values for these items, and only Items 40 and 33 had a W-RMSE value greater than Of the eleven items, Items 10, 28, 33, 40 and 41 were discussed in the previous Subsection 4.3.1, and Items 31, 40, and 41 were discussed in Section 4.2. Of the remaining items, Items 15, 32, and 35 are 3PL whose IRFs are shown Figure 4.18, and the IRFs of Items 17 and 22 (partial credit items) are in Figure Note that the IRFs for the 3PL items differed primarily in their lower asymptotes, which corresponded to the lower c parameter estimates 133

150 Table 4.6. Items with RMSE ITEM RMSE W-RMSE from ICL. The W-RMSE indicated that the differences in these ICCs were concentrated in an area of the ability range where very few examinees were located. Items 17 and 22 are partial credit items, both of which had RMSE values greater than for the first response category only, and both had very low examinee responses in this category: 21 for Item 17 and 28 for Item 22 (out of 1412 examinees). The IRFs for this category are shown in Figure Differences in parameter estimates were manifested in the item and test information functions. The 3PL items in particular play a role because a higher value of c translates into less information. Also, higher values of the a parameter yield more information. The test information and the standard error of measurement for both programs is shown in Figure

151 Figure PL Items with RMSE Figure GPC Items with RMSE

152 Figure Test Information and SEM Xcalibre s test information function had a maximum of at θ = 1.0, while ICL s maximum of occured at θ = 1.0. In general, ICL had lower c parameter estimates and higher a parameter estimates, and both of these differences contributed to the higher maximum for ICL. Despite this difference in the maximum amount of information, the location of the maximum was identical. The standard error (CSEM or SEM) is at minimum where the test information is maximized. Xcalibre had a minimum standard error of 0.31 at θ = 1.0, and ICL had a minimum of 0.29 at θ = Note that there was little difference in the standard error for θ ( 2, 2). The RMSE for the standard error curves in the range 3.3 θ 2.3 was

153 4.4. Ability Parameter Estimates Ability estimates were obtained from Xcalibre and ICL using EAP estimation. Figure 4.21 shows the joint histogram of the ability estimates, and the frequency distribution is in Table 4.7. The ability estimates from each program were very similar. Table 4.8 shows the descriptive statistics of the estimates. Figure Ability Distributions: Joint Histogram The examinees were sorted by their actual test scores, and the differences in the Xcalibre and ICL ability estimates were compared. The range of the θ estimates was ( 3.29, 2.24), and the average absolute difference was In general, Xcalibre gave slightly lower θ values for examinees of low ability and slightly higher values for those of high ability, and the largest differences in individual examinee ability estimates were found at the low and high ends of the ability scale. 137

Basic IRT Concepts, Models, and Assumptions

Basic IRT Concepts, Models, and Assumptions Lecture #2 ICPSR Item Response Theory Workshop Lecture #2: 1of 64 Lecture #2 Overview Background of IRT and how it differs from CFA Creating a scale An introduction