Multidimensional item response theory observed score equating methods for mixed-format tests

Size: px
Start display at page:

Download "Multidimensional item response theory observed score equating methods for mixed-format tests"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Multidimensional item response theory observed score equating methods for mixed-format tests Jaime Leigh Peterson University of Iowa Copyright 2014 Jaime Peterson This dissertation is available at Iowa Research Online: Recommended Citation Peterson, Jaime Leigh. "Multidimensional item response theory observed score equating methods for mixed-format tests." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 MULTIDIMENSIONAL ITEM RESPONSE THEORY OBSERVED SCORE EQUATING METHODS FOR MIXED-FORMAT TESTS by Jaime Leigh Peterson A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa August 2014 Thesis Supervisor: Associate Professor Won-Chan Lee

3 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Jaime Leigh Peterson has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the August 2014 graduation. Thesis Committee: Won-Chan Lee, Thesis Supervisor Michael J. Kolen Robert L. Brennan Timothy N. Ansley Aixin Tan Tianli Li

4 To Obie and Mayah ii

5 ACKNOWLEDGMENTS First, I would like to thank the College Board for allowing me to use the Advanced Placement data in my dissertation. Second, I would like to thank Dr. Lee for his guidance, patience, and encouragement as my thesis and academic advisor. You have enriched my psychometric training by immeasurable amounts and I am so grateful. Third, I am most certainly thankful to Dr. Kolen for his challenging yet approachable teaching style as my professor, CASMA supervisor, and thesis committee member. I am also very thankful for the thought-provoking suggestions provided by my entire thesis committee Dr. Lee, Dr. Kolen, Dr. Brennan, Dr. Ansley, Dr. Tan, and Dr. Li. I would also like to thank all of my classmates in the Measurement and Statistics program. You made my transition into the program seamless and at times when it seemed like I was about to derail, your comradery put me back on track. I likely would have never entered the educational measurement field if it were not for the Statistics and Measurement faculty at the University of Wisconsin-Milwaukee, where I started my degree. Taking yet another step back, I must thank Dr. MacDougall from Eckerd College for instilling the confidence in me that I was capable of achieving a Ph.D. and was actually good with statistics. Of course, I could not have pushed through the hardest times, those that made me question everything, without my family. Not to mention, their constant evolving ways of botching of the word Psychometrician provided me just the right amount of laughter at just the right times. Lastly, I wouldn t have finished my dissertation with nearly as much grace or composure if it were not for my fiancé, Nick. To him, I say Mahalo. iii

6 ABSTRACT The purpose of this study was to build upon the existing MIRT equating literature by introducing a full multidimensional item response theory (MIRT) observed score equating method for mixed-format exams because no such methods currently exist. At this time, the MIRT equating literature is limited to full MIRT observed score equating methods for multiple-choice only exams and Bifactor observed score equating methods for mixed-format exams. Given the high frequency with which mixed-format exams are used and the accumulating evidence that some tests are not purely unidimensional, it was important to present a full MIRT equating method for mixed-format tests. The performance of the full MIRT observed score method was compared with the traditional equipercentile method, and unidimensional IRT (UIRT) observed score method, and Bifactor observed score method. With the Bifactor methods, group-specific factors were defined according to item format or content subdomain. With the full MIRT methods, two- and four-dimensional models were included and correlations between latent abilities were freely estimated or set to zero. All equating procedures were carried out using three end-of-course exams: Chemistry, Spanish Language, and English Language and Composition. For all subjects, two separate datasets were created using pseudo-groups in order to have two separate equating criteria. The specific equating criteria that served as baselines for comparisons with all other methods were the theoretical Identity and the traditional equipercentile procedures. Several important conclusions were made. In general, the multidimensional methods were found to perform better for datasets that evidenced more multidimensionality, whereas unidimensional methods worked better for unidimensional datasets. In addition, the scale on which scores are reported influenced the comparative conclusions made among the studied methods. For performance classifications, which are most important to examinees, there typically were not large discrepancies among the UIRT, Bifactor, and full MIRT methods. However, this study was limited by its sole iv

7 reliance on real data which was not very multidimensional and for which the true equating relationship was not known. Therefore, plans for improvements, including the addition of a simulation study to introduce a variety of dimensional data structures, are also discussed. v

8 TABLE OF CONTENTS LIST OF TABLES... LIST OF FIGURES... ix xi LIST OF SYMBOLS... xviii CHAPTER I: INTRODUCTION... 1 Equating Designs... 1 Traditional Equating Methods... 3 Item Response Theory... 4 IRT Equating Methods... 5 Multidimensional Item Response Theory... 6 MIRT Equating Methods... 8 Research Objectives CHAPTER II: LITERATURE REVIEW Item Response Theory Unidimensional IRT Models UIRT Item and Test Characteristics Multidimensional Item Response Theory Multidimensional IRT Models MIRT Item Characteristics MIRT Test Characteristics MIRT Item and Test Information Dimensionality Assessment Procedures UIRT Scale Linking Procedures Scale Linking Methods for Dichotomous IRT Models Scale Linking Methods for Polytomous IRT Models MIRT Scale Linking Procedures Translation of the Origin Unit of Measurement of the Coordinate Axes Rotation of the Coordinate Axes Random Groups Design Nonequivalent Groups Design UIRT Equating Methods MIRT Equating Methods CHAPTER III: METHODOLOGY Data and Procedures Matched Samples Datasets Single Population Datasets Dimensionality Assessment Item Calibration and Model Data Fit Equating Procedures UIRT Observed Score Equating vi

9 Bifactor Observed Score Equating Full MIRT Observed Score Equating Evaluation Criteria Root Mean Square Difference Differences That Matter AP Grade Agreement Summary CHAPTER IV: RESULTS Sample and Test Characteristics Form Characteristics Test Blueprints Dimensionality Assessments Model Fit Fitted Distributions Traditional Equipercentile Method UIRT and MIRT Methods Equating Relationship for Single Population Datasets Chemistry Spanish Language English Language and Composition Equating Relationships for Matched Sample Datasets Chemistry Spanish Language English Language and Composition Weighted Root Mean Square Differences for Old Form Equivalents Single Population Datasets Matched Sample Datasets AP Grade Agreements Single Population Datasets Matched Sample Datasets Summary CHAPTER V: DISCUSSION Importance of the Equating Criterion Effect of Violating the IRT Unidimensionality Assumption: Research Objective # General Comparison of Equating Methods: Research Objective # Single Population Datasets Matched Sample Datasets Summary Effect of Latent Trait Structure with Full MIRT Equating: Research Objective # Single Population Datasets Matched Sample Datasets Summary Effect of Modeling Dimensionality According to Content or Item Format: vii

10 Research Objective # Single Population Datasets Matched Sample Datasets Summary Limitations and Future Considerations Conclusions REFERENCES APPENDIX: TABLES AND FIGURES viii

11 Table LIST OF TABLES A1. Scores and Weights of the MC and FR Items by AP Main Form Exam. 136 A2. Scores and Weights of the MC and FR Items by AP Alternate Form Exam. 136 A3. Raw-to-scale score conversion table for AP Chemistry. 137 A4. Raw-to-scale score conversion table for AP Spanish Language. 141 A5. Raw-to-scale score conversion table for AP English Language and Composition. 145 A6. Calibration settings used during item parameter estimation. 148 A7. Studied Conditions for Matched Sample and Single Population Datasets. 149 A8. Effect Sizes for Group Differences in Common Item Scores. 150 A9. Descriptive Statistics for Forms Associated with the Matched Samples Design. 150 A10. Descriptive Statistics for Forms Associated with the Single Population Design. 151 A11. Test Blueprint for Chemistry. 151 A12. Test Blueprint for Spanish Language. 151 A13. Test Blueprint for English Language and Composition. 151 A14. Poly-DIMTEST Test Statistic Probabilities for Matched Sample Forms. 152 A15. Disattenuated Correlations for Matched Sample Forms. 152 A16. AIC Indices for EFA and CFA Fitted Models. 152 A17. BIC Indices for EFA and CFA Fitted Models. 153 A18. Presmoothing Parameters used for the Old and New Form Distributions 153 A19. Weighted Root Mean Squared Differences for Raw Scores Using the Single Population Datasets and Identity Criterion. 153 ix

12 A20. Weighted Root Mean Squared Differences for Scale Scores Using the Single Population Datasets and Identity Criterion. 154 A21. Weighted Root Mean Squared Differences for Raw Scores Using the Matched Samples Datasets and Traditional Equipercentile Criterion. 154 A22. Weighted Root Mean Squared Differences for Scale Scores Using the Matched Samples Datasets and Traditional Equipercentile Criterion. 155 A23. Overall Exact Percent Agreement of AP Grades for Single Population Datasets. 155 A24. Overall Exact Percent Agreement of AP Grades for Matched Samples Datasets. 156 x

13 Figure LIST OF FIGURES 2-1. Example of 3PL Item Characteristic Curve Example of Test Information Function Example of Item Response Surface for the M2PL Model Category Response Surfaces for an Item Calibrated with the MGR Model Example of a Contour Plot for an Item Response Surface Vector Representation of Two-Dimensional Items. 29 A1. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Chemistry using the Single Population Dataset. 156 A2. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Spanish using the Single Population Dataset. 157 A3. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for English using the Single Population Dataset. 157 A4. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Chemistry using the Matched Samples Dataset. 158 A5. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Spanish using the Matched Samples Dataset. 158 A6. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for English using the Matched Samples Dataset. 159 A7. Scree Plot of Eigenvalues for Chemistry using the Matched Samples Dataset. 159 A8 Scree Plot of Eigenvalues for Spanish using the Matched Samples Dataset. 160 A9. Scree Plot of Eigenvalues for English using the Matched Samples Dataset. 160 A10. Chemistry New Form Observed and Fitted Distributions for Matched Samples Data. 161 A11. Chemistry Old Form Observed and Fitted Distributions for Matched Samples Data. 162 xi

14 A12. Spanish New Form Observed and Fitted Distributions for Matched Samples Data. 163 A13. Spanish Old Form Observed and Fitted Distributions for Matched Samples Data. 164 A14. English New Form Observed and Fitted Distributions for Matched Samples Data. 165 A15. English Old Form Observed and Fitted Distributions for Matched Samples Data. 166 A16. Chemistry New Form Observed and Fitted Distributions for Single Population Data. 167 A17. Chemistry Old Form Observed and Fitted Distributions for Single Population Data. 168 A18. Spanish New Form Observed and Fitted Distributions for Single Population Data. 169 A19. Spanish Old Form Observed and Fitted Distributions for Single Population Data. 170 A20. English New Form Observed and Fitted Distributions for Single Population Data. 171 A21. English Old Form Observed and Fitted Distributions for Single Population Data. 172 A22. Traditional Equipercentile and UIRT Equating Relationships and Differences for Chemistry Raw Scores using the Single Population Datasets. 173 A23. Bifactor Equating Relationships and Differences for Chemistry Raw Scores using the Single Population Datasets. 173 A24. Full MIRT Two-Dimensional Equating Relationships and Differences for Chemistry Raw Scores using the Single Population Datasets. 174 A25. Full MIRT Four-Dimensional Equating Relationships and Differences for Chemistry Raw Scores using the Single Population Datasets. 174 A26. Chemistry Raw Score Equating Relationships and Differences for all Methods using Single Population Datasets and Prior Setting (II). 175 xii

15 A27. Traditional Equipercentile and UIRT Equating Relationships and Differences for Chemistry Scale Scores using the Single Population Datasets. 175 A28. Bifactor Equating Relationships and Differences for Chemistry Scale Scores using the Single Population Datasets. 176 A29. Full MIRT Two-Dimensional Equating Relationships and Differences for Chemistry Scale Scores using the Single Population Datasets. 176 A30. Full MIRT Four-Dimensional Equating Relationships and Differences for Chemistry Scale Scores using the Single Population Datasets. 177 A31. Chemistry Scale Score Equating Relationships and Differences for all Methods using the Single Population Datasets and Prior Setting (II). 177 A32. Traditional Equipercentile and UIRT Equating Relationships and Differences for Spanish Raw Scores using the Single Population Datasets. 178 A33. Bifactor Equating Relationships and Differences for Spanish Raw Scores using the Single Population Datasets. 178 A34. Full MIRT Two-Dimensional Equating Relationships and Differences for Spanish Raw Scores using the Single Population Datasets. 179 A35. Full MIRT Four-Dimensional Equating Relationships and Differences for Spanish Raw Scores using the Single Population Datasets. 179 A36. Spanish Raw Score Equating Relationships and Differences for all Methods using the Single Population Datasets and Prior Setting (I). 180 A37. Traditional Equipercentile and UIRT Equating Relationships and Differences for Spanish Scale Scores using the Single Population Datasets. 180 A38. Bifactor Equating Relationships and Differences for Spanish Scale Scores using the Single Population Datasets. 181 A39. Full MIRT Two-Dimensional Equating Relationships and Differences for Spanish Scale Scores using the Single Population Datasets. 181 A40. Full MIRT Four-Dimensional Equating Relationships and Differences for Spanish Scale Scores using the Single Population Datasets. 182 A41. Spanish Scale Score Equating Relationships and for all Methods using the Single Population Datasets and Prior Setting (I). 182 xiii

16 A42. Traditional Equipercentile and UIRT Equating Relationships and Differences for English Raw Scores using the Single Population Datasets. 183 A43. Bifactor Equating Relationships and Differences for English Raw Scores using the Single Population Datasets. 183 A44. Full MIRT Two-Dimensional Equating Relationships and Differences for English Raw Scores using the Single Population Datasets. 184 A45. English Raw Score Equating Relationships and Differences using the Single Population Datasets and Prior Setting (I). 184 A46. Traditional Equipercentile and UIRT Equating Relationships and Differences for English Scale Scores using the Single Population Datasets. 185 A47. Bifactor Equating Relationships and Differences for English Scale Scores using the Single Population Datasets. 185 A48. Full MIRT Two-Dimensional Equating Relationships and Differences for English Scale Scores using the Single Population Datasets. 186 A49. English Scale Score Equating Relationships and Differences for all Methods using the Single Population Datasets and Prior Setting (I). 186 A50. Chemistry UIRT Observed Score Equating Relationships for Raw Scores Using the Matched Samples Datasets. 187 A51. Chemistry Bifactor Observed Score Equating Relationships for Raw Scores Using the Matched Samples Datasets. 187 A52. Chemistry Full MIRT Two-Dimensional Observed Score Equating Relationships for Raw Scores Using the Matched Samples Datasets. 188 A53. Chemistry Full MIRT Four-Dimensional Observed Score Equating Relationships for Raw Scores Using the Matched Samples Datasets. 188 A54. Chemistry Equating Relationships for Raw Scores Using the Matched Samples Datasets and Prior Setting (II). 189 A55. Differences Between UIRT Equivalents and Equipercentile Equivalents for Chemistry Raw Scores Using the Matched Samples Datasets. 189 A56. Differences Between Bifactor Equivalents and Equipercentile Equivalents for Chemistry Raw Scores Using the Matched Samples Datasets. 190 xiv

17 A57. Differences Between Full MIRT Two-Dimensional Equivalents and Equipercentile Equivalents for Chemistry Raw Scores Using the Matched Samples Datasets. 190 A58. Differences Between Full MIRT Four-Dimensional Equivalents and Equipercentile Equivalents for Chemistry Raw Scores Using the Matched Samples Datasets. 191 A59. Differences Between (M)IRT Equivalents (using Prior Setting II) and Equipercentile Equivalents for Chemistry Raw Scores for the Matched Samples Datasets. 191 A60. Differences Between UIRT Equivalents and Equipercentile Equivalents for Chemistry Scale Scores Using the Matched Samples Datasets. 192 A61. Differences Between Bifactor Equivalents and Equipercentile Equivalents for Chemistry Scale Scores Using the Matched Samples Datasets. 192 A62. Differences Between Full MIRT Two-Dimensional Equivalents and Equipercentile Equivalents for Chemistry Scale Scores Using the Matched Samples Datasets. 193 A63. Differences Between Full MIRT Four-Dimensional Equivalents and Equipercentile Equivalents for Chemistry Scale Scores Using the Matched Samples Datasets. 193 A64. Differences Between (M)IRT Equivalents (using Prior Setting II) and Equipercentile Equivalents for Chemistry Scale Scores for the Matched Samples Datasets. 194 A65. UIRT Observed Score Equating Relationships for Spanish Raw Scores Using the Matched Samples Datasets. 194 A66. Bifactor Observed Score Equating Relationships for Spanish Raw Scores Using the Matched Samples Datasets. 195 A67. Full MIRT Two-Dimensional Observed Score Equating Relationships for Spanish Raw Scores Using the Matched Samples Datasets. 195 A68. Full MIRT Four-Dimensional Observed Score Equating Relationships for Spanish Raw Scores Using the Matched Samples Datasets. 196 A69. Spanish Equating Relationships for Raw Scores Using the Matched Samples Datasets with Prior Setting (I). 196 A70. Differences Between UIRT Equivalents and Equipercentile Equivalents xv

18 for Spanish Raw Scores Using the Matched Samples Datasets. 197 A71. Differences Between Bifactor Equivalents and Equipercentile Equivalents for Spanish Raw Scores Using the Matched Samples Datasets. 197 A72. Differences Between Full MIRT Two-Dimensional and Equipercentile Equivalents for Spanish Raw Scores Using the Matched Samples Datasets. 198 A73. Differences Between Full MIRT Four-Dimensional and Equipercentile Equivalents for Spanish Raw Scores Using the Matched Samples Datasets. 198 A74. Differences Between (M)IRT Equivalents (using Prior Setting I) and Equipercentile Equivalents for Spanish Raw Scores Using the Matched Samples Datasets. 199 A75. Differences Between UIRT and Equipercentile Equivalents for Spanish Scale Scores Using the Matched Samples Datasets. 199 A76. Differences Between Bifactor and Equipercentile Equivalents for Spanish Scale Scores Using the Matched Samples Datasets. 200 A77. Differences Between Full MIRT Two-Dimensional and Equipercentile Equivalents for Spanish Scale Scores Using the Matched Samples Datasets. 200 A78. Differences Between Full MIRT Four-Dimensional and Equipercentile Equivalents for Spanish Scale Scores Using the Matched Samples Datasets. 201 A79. Differences Between (M)IRT Equivalents (using Prior Setting I) and Equipercentile Equivalents for Spanish Scale Scores Using the Matched Samples Datasets. 201 A80. Traditional Equipercentile and UIRT Observed Score Equating Relationships for English Raw Scores Using the Matched Samples Datasets. 202 A81. Bifactor Observed Score Equating Relationships for English Raw Scores Using the Matched Samples Datasets. 202 A82. Full MIRT Two-Dimensional Observed Score Equating Relationships for English Raw Scores Using the Matched Samples Datasets. 203 A83. Equating Relationships for English Raw Scores Using All Methods with the Matched Samples Datasets and Prior Setting (I). 203 A84. Differences Between UIRT and Equipercentile Equivalents for English Raw Scores Using the Matched Samples Datasets. 204 A85. Differences Between Bifactor and Equipercentile Equivalents for English xvi

19 Raw Scores Using the Matched Samples Datasets. 204 A86. Differences Between Full MIRT Two-Dimensional and Equipercentile Equivalents for English Raw Scores Using the Matched Samples Datasets. 205 A87. Differences Between (M)IRT Equivalents (using Prior Setting I) and Equipercentile Equivalents for English Raw Scores for the Matched Samples Datasets. 205 A88. Differences Between UIRT and Equipercentile Equivalents for English Scale Scores Using the Matched Samples Datasets. 206 A89. Differences Between Bifactor and Equipercentile Equivalents for English Scale Scores Using the Matched Samples Datasets. 206 A90. Differences Between Full MIRT Two-Dimensional and Equipercentile Equivalents for English Scale Scores Using the Matched Samples Datasets. 207 A91. Differences Between (M)IRT Equivalents (using Prior Setting I) and Equipercentile Equivalents for English Scale Scores for the Matched Samples Datasets. 207 xvii

20 LIST OF SYMBOLS a A b B c d IRT discrimination or slope parameter Scaling constant IRT difficulty or location parameter Scaling constant IRT pseudo-guessing parameter MIRT item intercept parameter D Scaling constant equal to 1.7 e h Exponential Threshold boundary parameter of polytomous item i Index denoting specific item (i = 1,..., n) j Index denoting examinee (j = 1,..., N) k Index denoting item response category (k = 1,..., K) K m n N N parm P P q r S s aa T U Number of response categories for a polytomous item Index denoting latent ability dimension Number of items Number of examinees Number of estimated model parameters Probability of earning a particular score Cumulative category response probability in polytomous model Population density function Index for calculating observed score distribution in (M)IRT Total number of raw score points on the New Form Number of raw score points that convert to the AP grade a Number correct true Score Item response v Indexes common items (v = 1,..., V) xviii

21 W x Z MC FR IRT MIRT UIRT 1PL 2PL 3PL GR GPC IRF ICC TCF TCC TIF IRS CRS TCS β τ θ θ a I Scoring function of polytomous item Observed score Maximum raw composite score for a test Multiple-choice Free response Item Response Theory Multidimensional Item Response Theory Unidimensional Item Response Theory One-parameter logistic IRT model Two-parameter logistic IRT model Three-parameter logistic IRT model Graded response model Generalized partial credit model Item response function Item characteristic curve Test characteristic function Test characteristic curve Test information function Item response surface Category response surface Test characteristic surface MIRT item threshold parameter IRT definition of true score Latent ability scalar Vector of latent abilities Vector of item discriminations Identity matrix xix

22 Rot υ γ α C MDIFF MDISC ES x σ 2 ρ TT ρ S1 S 2 ρ SS AIC BIC S Χ 2 χ 2 ψ eq(x) UVN MVN Orthogonal rotation matrix Vector of person coordinates in transformed coordinate system Vector of item parameters Vector of angles Matrix of scaling constants MIRT item difficulty index MIRT item discrimination index Effect size Sample mean Sample variance Disattenuated correlation Observed correlation between section scores Coefficient alpha reliability index for a test section Akaike Information Criteria Bayesian Information Criteria Item fit statistic Item fit statistic Multivariate ability density Equated observed score Univariate normal distribution Multivariate normal distribution xx

23 1 CHAPTER I INTRODUCTION For many testing programs it is often necessary to administer alternate forms of a test due to circumstances such as multiple testing dates and threats to test security. It is desirable that those forms can be given interchangeably across situations, which implies that examinees scores should not be dependent on the particular form taken. In an attempt to achieve this quality, test forms are typically built to the same content and statistical specifications, but even then, there are likely to be differences in difficulty between forms. Equating is a statistical procedure used to adjust for these differences in difficulty so that scores on alternate forms can be used interchangeably (Kolen & Brennan, 2004).The process of equating requires that several decisions be made in regard to the data collection design, method or type of equating, and statistical estimation methods. Equating Designs The random groups (RG) design is used in this dissertation and describes the situation where examinees are randomly assigned to alternate forms, usually through a spiraling process. The use of this design is advantageous in that the groups taking each form are considered randomly equivalent in ability levels (Kolen & Brennan, 2004). Therefore, differences in group-level performance can be attributed to differences in form difficulties, which can then be accounted for during the equating process. The major drawback of the RG design is that multiple forms are required to be administered simultaneously which may be too great a burden on test developers. If spiraling proves effective at producing groups with similar abilities, then another benefit is that the statistical estimation methods are simpler than if the common-item nonequivalent groups (CINEG) design had been used. When more than one form cannot be administered on a single testing date, the use of the CINEG design may be the best option. With this design, forms are administered to

24 2 separate groups of examinees, usually on separate testing dates. Each form contains a set of items that appear on both forms, which are referred to as common items. These common items are ideally designed to reflect a mini-version of the full length test with regard to content and statistical specifications. Common items can be classified as internal or external depending on whether scores on them contribute to an examinee s operational test score. With external anchors, one advantage of the CINEG design is that it is possible to disclose test items following test administration, which is required in some instances (Kolen & Brennan, 2004). The statistical procedures required by this design are more complicated in comparison to the random groups design because groups can no longer be considered randomly equivalent and therefore differences in groups and forms need to be disentangled, which requires strong statistical assumptions to be made. The CINEG design also requires the additional step of scale linking in order to put item and person parameter estimates on the same scale before equating can be done. This added step is not required with the RG design or the single group (SG) design, which will be described next. The SG design demands the most effort from examinees by requiring them to take both forms of the test. The SG design is usually carried out with counterbalancing in order to remove order effects such as fatigue or practice. One benefit of the SG design with counterbalancing is the smaller sample size needed in comparison to the RG design because each examinee serves as his or her own control (Kolen & Brennan, 2004). However, several drawbacks exist such as the threat of differential order effects and the increased testing time required to administer two full-length forms of a test. After the equating design has been established and data collected, there are several considerations that need to be taken into account when determining which equating method may be best to use. The choice of which method to use is not an exact science, but it does depend on the extent to which the assumptions inherent to each

25 3 method are likely to hold and other practical considerations as well. In the next section, several equating methods, along with their assumptions and implications are discussed. Traditional Equating Methods Equating methods typically differ in regard to the definition used to describe the equating relationship and the assumptions made within, both implicit and explicit. Traditional methods are those such as mean, linear, and equipercentile methods that mostly do not depend on a strong true score model and define the equating relationship for observed score distributions. The exception is with the Levine methods, which are linear and require the use of a true-score model. These methods make less strict assumptions than IRT methods because a psychometric model is not used to predict the relationship between item and person characteristics. In mean equating, the equated score distributions have equal means by adding or subtracting a constant along the entire score scale to adjust for differences in difficulty. The linear method is more general than the mean method as differences in difficulty are adjusted for by varying degrees along the score scale. After linear equating under the RG design, the equated score distributions have equal mean and standard deviations. However, this is only true for the synthetic population under the CINEG design. The linear methods with the CINEG design often involve statistical assumptions that are untestable. For example, for the CINEG design the Tucker linear method assumes the regression of New Form score (X) on common item score (V) in the Old Form group is equal to that in the New Form group (Kolen & Brennan, 2004). Since the Old Form group is never administered the New Form, this assumption is impossible to test, which is a disadvantage. Other assumptions for the Tucker linear method are testable such as that the regression of X on V is linear. The equipercentile methods are even more general than both mean and linear methods and unlike the other two methods, specify a curvilinear equating relationship. Equipercentile methods, such as the frequency estimation or chained methods, result in

26 4 the equated score distribution for the New Form (Form X) being similar in shape to the score distribution for the Old Form (Form Y). This holds for the RG design and for the synthetic population associated with the CINEG design. Many times, the equating relationship estimated by equipercentile methods appears bumpy due to sampling error (Kolen & Brennan, 2004). As a result, smoothing techniques are used to reduce random error while not introducing additional sources of systematic error. The majority of traditional equating methods do not impose a psychometric model on the data. This is with the exception of the Levine linear observed and true-score methods with the CINEG design that assume error variances are proportional to effective test lengths under the classical congeneric model. Not requiring the added step of model checking is usually taken as one advantage of traditional methods over IRT methods. However, under some circumstances such as equating to an item pool or when preequating is needed, traditional methods are less flexible and offer less utility than IRT methods (Kolen & Brennan, 2004). Item Response Theory In addition to equating applications, IRT is useful in many other situations such as test development, computerized adaptive testing, differential item functioning, and test scaling (Kolen & Brennan, 2004). At the foundation of IRT is a probabilistic function, referred to as an item response function (IRF) that specifies the relationship between item characteristics (i.e., item parameters) and the probability that a person will respond correctly to that item. Item responses are dependent on the latent ability being measured by the item (Lord, 1980). IRT models exist for both multiple-choice type items that are scored dichotomously (Birnbaum, 1968; Lord, 1980; Rasch, 1960) and for free response items, also referred to as polytomous items, which have more than two score points (Bock, 1972; Muraki, 1992; Samejima, 1969). The use of IRT requires that certain assumptions be met in order for results to be accurate and useful. Unlike classical test theory which makes assumptions at the test

27 5 level, IRT makes assumptions at the item level. These assumptions relate to a functional form of the mathematical model, unidimensionality (for unidimensional IRT models), and local or conditional independence. The first assumption states that the relationship between person responses and item characteristics is modeled appropriately by the parameters in the IRF. The second assumption of unidimensionality posits that item responses are governed by a single underlying latent ability (de Ayala, 2009). This also implies that items measure the same single latent ability. The third assumption of local independence follows naturally from the unidimensionality assumption (Lord, 1980), and implies that item responses are independent of responses to other items and only depend on the level of the latent ability possessed by the examinee. Local independence is often violated when items are clustered within reading passages. Whereas, the unidimensionality assumption is known to be suspect in scenarios with mathematical word problems, which require both reading and math skills in order for examinees to have a high probability of providing a correct answer. The unidimensionality assumption can also be violated more generally in situations dealing with mixed-format exams or innovative item types, as examples. IRT Equating Methods The use of IRT equating methods requires strong assumptions for all equating designs (Kolen & Brennan, 2004). In general, equating results have been found to be robust to moderate (but not severe) violations of the unidimensionality assumption (Bolt, 1999; Camilli et al., 1995; Cook et al., 1985; De Champlain, 1996; Dorans & Kingston, 1985; Yen, 1984). However, in any situation where violations are plausible, the robustness of parameter estimates and equating results should be investigated prior to IRT equating. IRT equating methods have been developed for observed scores and true scores. The process of true score equating involves equating true or latent trait scores through use of the test characteristic curve (TCC). True score equating involves three general steps: 1)

28 6 specify a true score on the New Form, 2) identify the latent trait score (θ) that corresponds to that true score, and 3) find the corresponding true score on the Old Form that corresponds to that θ (Kolen & Brennan, 2004). The second step usually involves the use of an iterative numerical procedure called the Newton-Raphson method. In observed score IRT equating, observed score distributions on both forms are first estimated by using an IRT model or a combination of models in the case of mixed-format tests (Kolen & Brennan, 2004). Then, traditional equipercentile methods are used to equate the two estimated observed score distributions. IRT observed score methods are computationally more feasible in situations where the unidimensionality assumption has been violated and multidimensional IRT (MIRT) models are needed, which will be discussed later. Multidimensional Item Response Theory According to Reckase (2009), it is hard to imagine a situation where all items on a test are pure measures of just one latent ability, such as is assumed by unidimensional IRT (UIRT) models. Multidimensional IRT (MIRT) models are more flexible by allowing items on a test to measure several latent traits, which likely resembles the underlying dimensional structure of tests more accurately, especially mixed-format tests. Even though UIRT equating methods are generally robust to small violations in unidimensionality, equating should be conducted in such a manner so that the effects of violations of assumptions are minimized (Kolen & Brennan, 2004). The use of MIRT models over UIRT models is one manner in which assumption violations can be lessened if evidence suggests the test is not unidimensional. Similar to UIRT models, MIRT models make assumptions concerning a functional form of the model, local independence, and dimensionality (de Ayala, 2009). In the MIRT framework, the local independence assumption is conditional on a vector of latent abilities rather than just one. Dimensionality assessments also should support the number of latent traits specified by the MIRT model.

29 7 Several MIRT models have been proposed in the literature and many were developed as extensions of their unidimensional counterparts. MIRT models exist for both dichotomously (McKinley & Reckase, 1982; Mulaik, 1972; Reckase, 1972; Sympson, 1978; Whitely, 1980) and polytomously scored items (de la Torre, 2008; Muraki & Bejar, 1995; Muraki & Carlson, 1995; Wang et al., 2004). MIRT models can generally be grouped into categories using a two-way classification scheme. First, models can be classified as either compensatory or non-compensatory. Compensatory models are additive and allow for skills on one trait to make up for or compensate for, deficiencies in skill levels on another trait. Two examples of compensatory MIRT models for dichotomous items are the multidimensional two- and three-parameter logistic (M-2PL & M-3PL) models (McKinley & Reckase, 1983; Reckase, 1985), although other dichotomous MIRT models exist as well. Some examples of polytomous compensatory MIRT models include the generalized two-parameter partial credit model (M-2PPC: Yao & Schwarz, 2006) and the multidimensional graded response model (M-GR: Muraki & Carlson, 1993). Non-compensatory models are multiplicative and require that skill levels on both traits be sufficient in order for an examinee to earn a higher score. Secondly, MIRT models can be classified according to the dimensional structure of the items. Complex-structure MIRT models are used with items that are hypothesized to measure more than one trait simultaneously. Complex structure items discriminate on more than one latent dimension and therefore have two or more non-zero discrimination (a) parameters. The M-3PL (Reckase, 1985), M-2PPC (Yao & Schwarz, 2006), and M-GR (Muraki & Carlson, 1993) are examples of complex MIRT models. Other models, such as the bi-factor (Gibbons & Hedeker, 1992) can be considered a constrained MIRT model because while items are allowed to measure more than one dimension, they can not measure more than two. Under the bi-factor model, items are specified to measure an overall general factor and one second-order factor. The second order factor could represent item type where multiple-choice (MC) items would then discriminate along the

30 8 general trait and the MC-specific trait, but not on the trait dimension specific to constructed response (CR) items. The testlet response theory model (TRT; Bradlow, Wainer, & Wang, 1999) is a constrained version of the bi-factor model. MIRT models can also specify a simple structure or approximate simple structure relationship where items discriminate along only one dimension or very highly on one dimension and negligibly on the other dimension(s), respectively. The latent person space is multidimensional given the items do not all measure along the same dimensions. MIRT developed out of the need for more complex item response models. With the introduction of these new models came the need to develop new linking and equating methodology that could handle these increased complexities. Progress in the area of MIRT equating has been slow and according to Reckase (2009), there has been little research that shows that the results meet the theoretical requirements for equating (p. 309). MIRT Equating Methods The goal of MIRT equating is the same as with UIRT equating, except that the latent person space is multidimensional rather than unidimensional. This compounds the task of scale linking or placing results from alternate forms on the same coordinate system, especially as the dimensionality increases. Scale linking procedures for MIRT have been developed to put item and person parameter estimates on the same scale (Hirsch, 1989; Li & Lissitz, 2000; Oshima, Davey, & Lee, 2000; Thompson, Nering, & Davey, 1997; Yao & Boughton, 2009). In addition to adjusting for differences in origin and unit of measurement in UIRT scale linking, differences in correlation and rotation also need to be adjusted (Reckase, 2009; Thompson et al., 1997). It should be noted that within the MIRT framework, origin and unit of measurement are commonly referred to as translation and dilation, respectively. The literature on MIRT linking procedures is more expansive than for MIRT equating procedures.

31 9 Similar to UIRT equating methods, the data collection designs for MIRT equating include the single group, random groups, and common-item nonequivalent groups. A prerequisite to equating alternate test forms is that they be parallel (Davey, Oshima, & Lee 1996). In the MIRT context, parallelism in regard to content and statistical specifications may be more difficult to attain, because more parameters need to be taken into account. Another prerequisite for MIRT equating is that item and ability estimates be on the same scale. The real datasets used in this dissertation were originally collected using the CINEG design, but were manipulated to approximate the RG design so that scale linking would not be required. Observed score MIRT equating is conducted in a similar manner as UIRT observed score equating except that observed score distributions are conditional on a vector of latent abilities rather than a single ability. Furthermore, the ability density used to obtain the marginal observed score distributions is multivariate instead of univariate (Brossman & Lee, 2013). True score MIRT equating methods are much more complicated compared to the UIRT case. This is because the TCC is replaced with a test characteristic surface (TCS) and instead of one-to-one correspondence between true score and latent ability there is a one-to-many correspondence. The first full MIRT equating procedure was illustrated by Brossman and Lee (2013), but for dichotomously-scored tests only. Specifically, they provided examples of three equating methods: 1) Full MIRT observed score equating, 2) unidimensional approximation to MIRT observed score equating, and 3) unidimensional approximation to MIRT true score equating. Both UIRT approximation methods were extensions of earlier work by Zhang and colleagues (Zhang, 1996; Zhang & Stout, 1999; Zhang & Wang, 1998). Equating methods for tests comprised solely of testlets have been presented as well, using Bradlow, Wainer, and Wang s (1999) testlet response theory model and Gibbons and Hedeker s (1992) bi-factor model (He et al., 2011; He et al., 2012). However, all of the MIRT equating studies listed thus far have been limited to tests

32 10 comprised solely of dichotomously-scored items. In 2012, a study by Lee and Brossman illustrated equating methods for mixed-format tests under a simple structure MIRT framework. More recently a study by Lee and Lee (2014) demonstrated equating methods for bi-factor MIRT observed score equating for mixed-format tests. The two studies mentioned here are the only ones to the best of this author s knowledge that have looked at MIRT equating methods for mixed-format tests. Research Objectives It has been suggested that when a test consists of both multiple-choice (MC) and free response (FR) items, the different item types measure somewhat different traits and thereby violate the IRT unidimensionality assumption (Thissen, Wainer, & Wang, 1994; Wainer & Thissen, 1993). As a result, observed score equating methods using the bifactor and simple structure MIRT models, which take into account item format have been proposed (Lee & Brossman, 2012; Lee & Lee, in press). However, these studies are few and far between and more research is warranted to address the complications associated with MIRT equating. In addition to multidimensionality arising from the interaction of person and item format effects, it can also arise from the interaction of persons and content subdomains. According to Reckase (2009), examinees are likely to rely on several abilities when responding to test items, and the tasks required by each item are likely to call on several skills and abilities in order for the correct answer to be chosen. A study done in the field of cognitive science (i.e., Frederikson, Glaser, Lesgold, & Shafto, 1990) illustrated that multiple skills are required to correctly answer a single test item. This requirement of multiple abilities is thought to be especially applicable to measures of achievement in the natural sciences (Reckase, 2009). Therefore, the main objective of this dissertation is to expand upon the MIRT observed score equating literature in the context of mixed-format tests. Specially, the main objectives are to: 1. Present observed score equating methods for mixed-format tests using a full MIRT modeling framework.

33 11 2. Compare the differences in equating results from using full MIRT, Bifactor MIRT, UIRT, and traditional equipercentile methods. 3. Compare differences in equating results for the full MIRT method when the correlations between latent traits are specified as zero versus when they are freely estimated. 4. Examine whether dimensionality due to item format versus dimensionality due to content subdomain influences the relationships among the four equating methods in a similar manner. 5. Determine how the severity of the violation of unidimensionality, as measured by a combination of dimensionality assessment procedures, affects equating results.

34 12 CHAPTER II LITERATURE REVIEW Chapter Two provides background information and a theoretical foundation for the topics discussed in this dissertation. The first two sections provide information on unidimensional and multidimensional IRT such as model specifications, assumptions, applications, and the similarities between the IRT and MIRT frameworks. The third section provides an overview of several procedures used to assess data dimensionality. The following two sections give background information concerning UIRT and MIRT scaling and linking methods under various data collection designs. The last two sections offer an overview of the published literature related to UIRT and MIRT equating methods. Item Response Theory Item response theory (IRT) is a useful psychometric tool because it allows for predictions to be made about how examinees will answer test items without actually having to administer them (Lord, 1980). Predictions take the form of probability statements which are governed by a set of item and person characteristics or parameters. The parameters used to describe a single item vary depending on the particular IRT model, but usually include a parameter for item location (b), item discrimination (a), and sometimes pseudo-guessing (c). The person parameter (θ) represents a latent trait, skill, or ability that the items intend to measure, and is expressed on the same scale (i.e., the θ scale) as the item. Using IRT, the probability that an examinee with a certain ability level will correctly answer an item with particular characteristics can be expressed with an item response function (IRF). Unidimensional IRT Models The IRF for the two-parameter logistic (2PL) and three-parameter logistic (3PL) models (Birnbaum, 1968) respectively, can be expressed as

35 13 P i (θ j ) = e 1.7a i(θ j b i ) (2.1a) and P i (θ j ) = c i + 1 c i 1 + e 1.7a i(θ j b i ). (2.1b) Here, ai represents the discrimination of item i, bi is the item s location, ci is the item s pseudo-guessing parameter, θj is the latent ability of person j, and 1.7 is a scaling constant (i.e., D) used to make results comparable to the normal ogive metric. The c- parameter represents the probability of a correct response for an examinee with very low ability (i.e., θ = ) (Lord, 1980). The b-parameter represents the item s difficulty and corresponds to the point on the theta scale where the probability of a correct response is halfway between c and 1.0. The a-parameter reflects the degree to which an item can discriminate among examinees at bi, with higher positive values indicating greater discriminating power. The 2PL and 1PL models are special cases of the 3PL model and all are used with dichotomous or multiple-choice (MC) test items. The 2PL model specifies the pseudo-guessing parameter to be 0, while the 1PL makes the additional constraint of equal discriminations for all items. IRT models have also been developed for free-response (FR) items that have more than two score categories. Polytomously-scored items have the benefit of providing more information about an examinee s ability level in comparison to dichotomously scored items, but at the cost of requiring greater testing time. This gain in information, along with other considerations such as improved score validity, has led many state and national testing programs to use mixed-format tests that consist of both MC and FR item types. Two of the most commonly used polytomous IRT models are Samejima s (1969) graded response (GR) model and Muraki s (1992) generalized partial credit (GPC) model.

36 14 The GR model (Samejima, 1969) was developed for items with ordered response categories. These categories can represent partial credit achieved on problems with multiple steps, levels of completeness evaluated by human raters, or endorsement levels to attitudinal survey items (Samejima, 1969, 1972). To compute the probability that an examinee will respond in one of K categories for item i, first the cumulative category response function must be calculated such as, P ik (θ j ) = 1, k = 1, P ik (θ j ) = e [Da i (θ j b ik )] 1+ e [Da i (θ j b ik )], k = 2,..., K i. (2.2a) Here, bik is the location of the k th category boundary for item i, and all other parameters can be interpreted consistently using definitions previously provided. The probability of scoring in the lowest category is always one, because all examinees are able to score at least a zero. The cumulative response function is interpreted as the probability that an examinee will score in category k or higher. In order to compute the probability that an examinee will respond in a particular category, the difference between cumulative categories is computed as, P ik (θ j ) = P ik (θ j ) P i(k+1) (θ j ), k = 1,..., K i 1, P ik (θ j ) = P ik (θ j ), k = K i. (2.2b) The GPC model (Muraki, 1992) is a more general version of Master s (1982) partial credit (PC) model and can be used with items that have either ordered or unordered categories. Under the GPC model, item discrimination parameters are free to vary, whereas in the PC model they are constrained to be the same. Unlike the GR model, the category response function is computed in one step as, P ik (θ j ) = e [ k h=1 Da i(θ j b i +h ik )] e [ g. K i h=1 Da i(θ j b i +h ik )] g=1 (2.3)

37 Probability 15 Here, hik is the threshold boundary for category k of item i, g is an index used in the summation of the denominator over the Ki categories, and all remaining parameters have been previously defined. For an item with Ki response categories, only Ki 1 threshold boundaries can be defined and as a result one hik is arbitrarily set to zero, typically resulting in hi1 = 0. UIRT Item and Test Characteristics In IRT, item characteristics can be graphically displayed through an item characteristic curve (ICC). An advantage of the ICC is that it simultaneously portrays the item characteristics in relationship to the entire θ scale. An example of an ICC for the 3PL model can be seen in Figure Theta Figure 2-1. Example of ICC for an item calibrated under the 3PL model. Whereas an ICC is used to graphically display the characteristics of a single item, test characteristic curves (TCCs) are used to display the characteristics of a collection of items or an entire test. Similar to the ICC, TCCs can be used to illustrate how a test

38 16 functions along the entire θ scale, and are formed by summing over the ICCs for every item on the test (regardless of number of response categories), conditional on θ. The test characteristic function, which is the mathematical expression of the TCC, is useful for converting θ estimates to number-correct scores. This is considered an advantage because IRT θ-estimates are not usually considered ideal for reporting purposes. Often, the total score metric is preferred and IRT ability estimates are transformed to total scores through the TCF by summing over the item response functions for all items on the test. The sum of these probabilities is indicative of the number of items on the test that the examinee is expected to answer correctly and is also known as the IRT definition of an examinee s true score. Therefore, the estimated relationship between true scores and their corresponding θs is displayed graphically through the TCC. The test information function (TIF) is another way to enhance score interpretation by providing an indication of the amount of measurement precision associated with each θ level. First, the item information function is computed as the inverse of the standard error of measurement for each item, conditional on θ. The TIF is taken as the sum of the item information functions. The greater the error variance there is, the less information there is available to estimate an examinee s ability and hence the more uncertainty there is in score interpretation. An example of a TIF can be seen in Figure 2-2.

39 Information Theta Figure 2-2. Example of Test Information Function. The highest point on the curve indicates where on the θ scale the test is most accurate at estimating ability. Often, target TIFs are used during test construction to develop tests that provide the greatest accuracy near a particular cut score or within a particular score range. IRT is sometimes referred to as a scaling theory because it renders item and examinee characteristics directly comparable by placing them on the same scale. In order for the item and examinee parameter estimates to be accurate, the assumptions used to define the scale must hold. The assumption most central to the research questions addressed in this dissertation is the unidimensionality assumption which posits that the IRF is governed by a single latent ability or can be represented in one-dimensional space. If this assumption is violated and test items truly measure multiple traits but are modeled as measuring just one trait, the scaling of items and examinees can be inaccurate. For example, Walker and Beretvas (2003) found classification accuracy decreased as a result of using a unidimensional IRT model rather than a multidimensional model when

40 18 evidence suggested the test data were two-dimensional. In this situation, if two examinees possess similar levels of a primary trait but different levels of the secondary trait, and only the first trait is taken into account, the examinees will appear to be similar when in reality they are different. In order to model these differences with greater accuracy, multiple dimensions need to be considered, as is done in multidimensional IRT (MIRT) methods. Multidimensional Item Response Theory MIRT is an extension of UIRT and was developed to more accurately estimate the relationship between items and examinees in situations where items measure more than one latent trait. The need for MIRT stemmed from findings that UIRT models can oversimplify, and thereby misrepresent, the relationship between items and examinees when the latent test space is multidimensional. Over the years, studies have pointed out that cognitive processes are complex and memory retrieval tactics vary widely depending on the task at hand (Carroll, 1993; Frederikson et al., 1990; Healy & McNamara, 1996). It is reasonable to suspect that cognitive processing may take on different forms depending on factors such as item-format (i.e., MC vs. FR) or content subdomain (i.e., plant biology, invertebrate biology, etc.). As pointed out by Reckase (2009), the National Assessment of Educational Progress (NAEP) uses multiple-choice, open-ended, and extended-response item types in order to assess the breadth of skills involved in reading. The different item types are thought to measure different constructs or levels of reading skill acquisition. In this example, simultaneously estimating ability using a combination of dichotomous and polytomous UIRT models would ignore how different item types measure different combinations of latent traits. Whereas, using a MIRT model would avoid introducing measurement error due to oversimplification of the model. Several other studies have also suggested that different item formats tend to measure different constructs (Lee & Brossman, 2012; Li, Lissitz, & Yang, 1999; Tate, 2000; Yao & Boughton, 2009).

41 19 The concepts used in MIRT share many similarities with those used in UIRT. For instance, the probability of an examinee correctly responding to a particular item is still expressed as a function of both item and examinee characteristics. However, instead of this relationship being modeled in one dimension, it is modeled in m dimensions. As a result, response probabilities are conditional on multiple latent abilities, which are typically represented by a vector of θs. The general form of the item response function for the MIRT models considered in this dissertation is P(U i θ) = f(θ, γ), (2.4) where U i is the item response for a particular person, θ is a vector of person ability parameters, and γ is a vector of item parameters (Reckase, 2009). Similar to UIRT models, MIRT models are typically assumed to be represented by monotonically increasing item response functions. The local independence assumption also applies in the MIRT context, and states that item responses for a particular examinee are mutually independent and can be modeled using only θ and γ, for that specific examinee and item, respectively. MIRT models can be classified as either compensatory or noncompensatory depending on whether the m latent traits are specified to have a multiplicative or additive relationship as modeled by the IRF. For noncompensatory models, the relationship is multiplicative such that levels of the m latent traits both impact the response probability. For compensatory models, a high level of one trait can make up for, or compensate for, deficiencies on another trait because the relationship is additive. The current dissertation focusses only on compensatory MIRT models wherein a linear combination of θ coordinates is used to predict response probabilities. This decision was made based on software availability and the fact that compensatory MIRT models are more prevalent in the literature.

42 20 Multidimensional IRT Models The general exponential form of dichotomous UIRT models is a(θ-b), as can be seen in Equations 2.1a and 2.1b. The UIRT form can be made to have the form of aθ + d by multiplying each term by a and then replacing ab with d. This new form, aθ + d, is referred to as the slope/intercept form and is commonly used in MIRT models. The a- parameter is related to the slope of the IRF and is referred to as the discrimination parameter as in UIRT. In MIRT models there is usually a vector of discrimination values (i.e., a), which each correspond to one of the m dimensions of the latent test space. The d- parameter is referred to as the intercept but does not provide an overall indication of item difficulty as with UIRT models. The intercept can be transformed by dividing its negative by one of the m discrimination parameters to get the relative difficulty of that item in relationship to dimension m (Reckase, 2009). However, this index of relative difficulty is not easily interpreted and as a result, difficulty is usually computed as b = d aa = d m v=1 a v. (2.5) The b-parameter, which indicates the item s location in the m-dimensional space is also referred to as MDIFF. The b-parameter is interpreted in the same manner as for UIRT models such that larger values indicate greater overall item difficulty. In order for the parameters in MIRT models to be identified, the origin and scale of each coordinate axis must be fixed and the correlations among them must be specified. As mentioned previously, MIRT models can be classified as either compensatory or noncompensatory. Although there is no definitive evidence that suggests one model type is superior to the other, Bolt and Lall (2003) found compensatory models fit data from an English exam better such that they resulted in greater accuracy in parameter estimation when compared to noncompensatory models. As mentioned earlier, this dissertation focusses only on compensatory MIRT models because they are more

43 21 prevalent in the literature and because software development is more advanced for these models. Therefore, from this point forward it can be assumed that all MIRT models discussed are compensatory in nature. The IRF for the two-parameter logistic (2PL) model can be extended to the multidimensional two-parameter logistic (M2PL; McKinley & Reckase, 1982) model as, P i (U i = 1 θ j, a i, d i ) = ea i θ j +di 1+e a i θ j. +d i (2.6) Here, the UIRT θ is replaced with θ, which is a 1 x m vector of examinee coordinates in m-dimensional space; and a is replaced with a, which is a 1 x m vector of discrimination parameters. The parameter d is a scalar and represents the item s intercept. Similar to UIRT models, the M2PL model can be made comparable to the normal ogive metric by multiplying the term, aiθj + di, in the numerator and denominator by the scaling constant D = 1.7. The multidimensional 3PL (M3PL; Reckase, 1985) model is an extension of the UIRT 3PL model and includes a pseudo-guessing parameter which results in a non-zero lower asymptote. All other terms in the M3PL are interpreted in the same way as in the M2PL. The IRF for the M3PL is expressed as P i (U i = 1 θ j, a i, c i, d i ) = c i + (1 c i ) ea i θ j +di 1+e a i θ j. +d i (2.7) MIRT models also exist for items with more than two response categories and are referred to as polytomous MIRT models, and can be thought of as extensions of their UIRT counterparts. The multidimensional extension of the generalized partial credit (MGPC) model was developed by Yao and Schwarz (2006), but the parameterization provided here is arbitrarily adopted from Reckase (2009). For the MGPC model, the maximum score for item i is Ki, the minimum score is zero, and there are Ki + 1 score categories in total. The possible scores attainable by person j on item i is denoted as k = 0, 1,..., Ki. The IRF for the MGPC model is expressed as

44 22 e{ka iθ k j x=0 β ik } P i (U i = k θ j, a i, β ik ) = e {va iθ j k. K i x=0 β ik } x=0 (2.8) Here, βik represents the threshold parameter for category k, βi0 is defined as zero, and all other terms have been previously defined. Different from the UIRT version of the GPC model, the MGPC model does not make a distinction between the item s overall location and the location of each category boundary. Instead, there is a single threshold parameter for each category boundary. The expected score for examinee j, on item i, is computed as K i E(x θ j ) = kp i (U i = k θ j ). k=0 (2.9) Muraki and Carlson (1993) developed a multidimensional extension of the graded response (MGR) model that is expressed in the normal ogive metric. Similar to the UIRT GR model, the MGR model assumes ordered response categories and that successful completion of step k requires that step k 1 be successfully completed first. The minimum and maximum scores for item i are zero and Ki, respectively. The probability of scoring at each of the k steps is essentially modeled by the M2PL model with responses below k scored as 0 and those equal to or greater than k scored as 1. Then, the probability of scoring in a particular category, k, is found by subtracting the probability of scoring at the k + 1 step from the probability of scoring at step k, as given by P i (U i = k θ j ) = P i (U i = k θ j ) P i (U i = k + 1 θ j ). (2.10) The cumulative category response probability for U i = 0 is always 1, and for U i = Ki + 1 is always 0. The mathematical form of the normal ogive MGR model is given as a i θ j +d ik P i (U i = k θ j ) = 1 e t2 2 dt. (2.11) 2π a i θ j +d i,k+1 Here, k represents the examinee s score on item i and can take the values of 0, 1,..., Ki; ai is a vector of discrimination parameters; and dik is a parameter representing the

45 23 easiness of reaching the k th step of the item. Contrary to usual conventions, larger positive values of dik indicate the step is easier to achieve and larger negative values indicate the step is more difficult. The dik parameter for a score of zero is set to and for a score of Ki + 1 is set to. The Bifactor (Gibbons & Hedeker, 1992) MIRT models differ from the aforementioned MIRT models in that items are allowed to load on one general dimension and one group-specific dimension. The full MIRT models make no distinction concerning the hierarchy of the traits being measured and do not impose the constraint that an item can have only two non-zero discrimination parameters. Although the Bifactor model is not a hierarchical model in the traditional sense, it can be estimated as one seeing that the general dimension is common to all items whereas the group-specific dimensions are common only to a set of secondary item clusters. There has been increasing interest in the use of the Bifactor model due to its computational simplicity in estimation and its straightforward interpretability. High dimensional Bifactor models have a computational demand equivalent to traditional two-dimensional MIRT models because regardless of the number of factors in the model, the computations are limited to a series of twodimensional integrals. An example of a factor pattern for a three-item test under the Bifactor model is a 10 [ a 20 a 30 a 11 a a 32 ]. It can be seen that all items load on the general or first dimension, while the first two items have an additional loading on the first group-specific dimension and the third item on the second group-specific dimension. In this example, the general dimension could represent the subject area of science while the specific dimensions represent item format such that the first two are MC items and the third is an FR item. This modeling approach allows for the residual variance due to item formats to be taken into account. Specific

46 24 dimensions could also represent content, cognitive, or process domains and would depend on the design and purpose of the test. Under the Bifactor MIRT model, the assumption of conditional independence is made with respect to all latent dimensions, general and group-specific. The Bifactor model allows for test items with different numbers of response categories to be modeled using differed IRT models such that both dichotomous and polytomous models can be calibrated simultaneously. The probability of a correct response given an examinee s general (θ 0 ) and specific (θ S ) latent abilities under the Bifactor extension of the UIRT 3PL model was provided by Cai, Yang, and Hansen (2011) as, 1 c i P i (U i = 1 θ 0, θ S ) = c i e { D[d i+a i0 θ 0 +a is θ S ]}. (2.12) Here, di represents the item intercept, a i0 represents the slope of the item on the general dimension, a is is the item slope for the group-specific dimension, and D is a scaling constant. Cai et al. (2011) also presented a logistic version of the Bifactor extension of the GR model that is similar to the MGR model provided by Muraki and Carlson (1993). For an item with K response categories, let U i {0, 1,..., K 1} so that the cumulative response probability is P i (U i K 1 θ 0, θ S ) = e. (2.13) { D[d i,k 1+a i0 θ 0 +a is θ S ]} There is a total of K 1 cumulative response functions for an item with K response categories. The notation used to describe the parameters in the model is the same as previously defined. As with the UIRT and MIRT GR models, the probability of responding in a particular category under the Bifactor GR model is found by taking the difference between two adjacent cumulative response functions.

47 25 A Bifactor extension of the unidimensional GPC model was provided by Cai et al. (2011) with notation modified from Thissen, Cai, and Bock (2010). The probability of responding in category k = 0, 1,..., K 1 is given as P i (U i = k θ 0, θ S ) = e {W k [a i0θ 0 +a is θ S ]+d ik } K 1 e {W g [a i0θ 0 +a is θ S ]+d ig } g=0, (2.14) where W k represents the scoring function of the k th category and g is a subscript used for summation in the denominator. The intercept (d) parameters in the Bifactor extension of both the GR and GPC models are interpreted as locations. Furthermore, if an item loads only on the general dimension and has zero loadings on all specific dimensions, the aforementioned Bifactor models reduce back to their UIRT counterparts. For model identification purposes, it is usually assumed that the latent traits (both general and specific) are mutually orthogonal. However, this assumption can be relaxed so that orthogonality holds between the group-specific factors, conditional on the general factor (Rijmen, 2009). For mixed-format tests, the conditional category response probability for examinee j s response to item i under the Bifactor model is P i (U i = k θ 0, θ S ) for k = 0,..., Ki 1 and takes on a multinomial distribution. Cai et al. (2011) provide the conditional density function for an item as K i 1 f(x θ 0, θ S ) = P i (U i = k θ 0, θ S ) χ k(x ij ). k=0 (2.15) The indicator function, χ k (x ij ), takes on the value 1 when U i = k and 0 otherwise, thereby pulling out the category response probabilities corresponding to examinee j s observed responses. MIRT Item Characteristics Similar to items calibrated under unidimensional models, items calibrated under MIRT models can be characterized according to an overall description of their item

48 26 difficulty and discrimination. A single index, MDISC, is used to represent the discriminating power of an MIRT item and corresponds to the point on the item response surface (IRS) in the θ-space with the steepest slope. Similarly, MDIFF, is an indicator of the overall difficulty of an MIRT item and represents the distance from the origin in the θ-space to the point that is below the point of steepest slope. Using the M2PL model as an example, the multidimensional discrimination index for item i, is m 2 MDISC i = a ik k=1. (2.16) The item discrimination index can be computed for both dichotomous and polytomous (Muraki & Carlson, 1993) MIRT models. For item i, the multidimensional difficulty, MDIFFi, is computed as MDIFF i = d i MDISC i. (2.17) The multidimensional difficulty of an item is interpreted with respect to the direction corresponding to the steepest slope in the latent space. Similar to UIRT models, large positive values of MDIFF indicate greater item difficulty, while large negative values indicate less difficulty. While Equation 2.17 is applicable for the dichotomous MIRT models presented previously, the equivalent can be found for the category boundary parameters of polytomous items (Muraki & Carlson, 1993) using, MDIFF ik = d ik MDISC i. (2.18) Here, MDIFFik represents the difficulty of the k th step of the graded response item and dik is the corresponding category boundary parameter. Several approaches can be used to graphically display the characteristics of MIRT models and many are extensions from the unidimensional context. For UIRT, an examinee s item response probability is related to their latent ability through the ICC,

49 27 whereas for MIRT models response probabilities are related to more than one latent ability through the item response surface (IRS). The IRS is sometimes referred to as the surface plot and can be seen in Figure 2-3 (Reckase, 2009) for a dichotomous item calibrated under the M2PL model. Figure 2-3. Example of IRS for the M2PL Model. From the IRS, it can be determined the point along the ability scale at which the item best discriminates by finding the point on the curve with the steepest slope. Under the MIRT framework, an item can discriminate among many different composites of the m latent abilities assessed by the item. However, an item has greatest discrimination in just one direction within the multidimensional space, but at the same time, has a range of composite levels at which discrimination is adequate enough to provide information for ability estimation (Ackerman, 1996). For a model that incorporates a pseudo-guessing

50 28 parameter (i.e., M3PL) the lower end of the IRS in Figure 2-3 would correspond to ci and not zero. For polytomous items, the surface plot has as many surfaces as there are response categories. An example of a category response surface (CRS) plot for an item calibrated under the MGR model can be seen in Figure 2-4 (Reckase, 2009). Figure 2-4. Category Response Surfaces for an Item Calibrated with the MGR Model. The darkest surface corresponds to the probability of a score of zero and each subsequent surface becomes lighter as the scoring function increases. A similar pattern will emerge for a CRS under the MGPC model. In the same way that ICCs can be combined to create a TCC in UIRT, the CRSs can be combined to form a test response surface for MIRT models. In MIRT, contour plots are sometimes used to provide information above and beyond what the IRS plot provides. An example of a contour plot can be seen in Figure 2-5 (Reckase, 2009).

51 29 Figure 2-5. Example of a Contour Plot for an Item Response Surface. The contour lines are equiprobability lines meaning that examinees with θ estimates on the same line have equal probabilities of answering the item correctly. The solid arrow in Figure 2-5 is on the line which corresponds to a 0.3 probability of a correct response. Reckase (2009) points out that since the response probability is constant in this direction, the slope in this direction is effectively zero. In compensatory MIRT models, the direction perpendicular to the contour lines, indicated by the dashed arrow in Figure 2-5, corresponds to the composite (which is a linear combination of m traits) that is best measured by the item. Points in the contour plot where the lines are the closest indicate the area in which the item is most discriminating and thereby provides the greatest information. The directional vector plot is another graphical tool used in MIRT analyses and has the advantage of being able to represent more than one item at a time. Vector plots use the vector notation developed by Reckase and McKinley (1991) to represent the

52 30 characteristics of an item. An example of a vector plot from a study by Ackerman (1996) can be seen in Figure 2-6. Figure 2-6. Vector Representation of Two-Dimensional Items. The discriminating power of an item is marked by the length of the vector with longer vectors indicating greater discrimination. The angular direction of the vector from the primary axis (i.e., θ1) represents the direction in the multidimensional space that the item measures best. The difficulty of the item is denoted by the location of the item s origin in the space. Easier items will be located in the third quadrant (i.e., Item 1) and more difficult items in the first quadrant (i.e., Items 2 & 3). MIRT models can also be evaluated at the test-level using concepts similar to the unidimensional TCC and TIF, which will be discussed next. MIRT Test Characteristics Test characteristic surfaces (TCS) are used with MIRT models to describe the regression of the expected number-correct scores on the θ-coordinates. The TCS is found by summing over the IRS for all the items in the test, regardless of item format. The

53 31 mathematical expression for the TCS is the same as for the TCF, except the expectation is conditional on a θ-vector rather than a single θ. Another way to convey the characteristics of a multidimensional test is through the use of a reference composite. The reference composite indicates the orientation of a unidimensional θ-scale in the multidimensional space, if a unidimensional model were to be fit to the data. Wang (1985, 1986) demonstrated that the unidimensional projection which makes up the reference composite can be found as the eigenvector of the a'a matrix that corresponds to the largest eigenvalue. Here, a represents the matrix of all item discrimination estimates for the test. Angles are used to describe the orientation of the reference composite in regard to the m-coordinate axes and can be computed by taking the arccosine of the eigenvector elements. For a two-dimensional solution, a reference composite that measures both dimensions equally well will be represented by a line with approximately equal angles with the θ1 and θ2 coordinate axes. Zhang and Stout (1999a) presented a similar, yet different, way to impose a UIRT scale in multidimensional test space. Specifically, they proposed a direction of best measurement, as a linear composite of θ values for which the test provided the greatest amount of information. MIRT Item and Test Information Item information in multidimensional space is dependent upon the direction of movement from one θ-centroid to another. For example, starting from a particular θ- centroid in the multidimensional space, information values will differ depending on whether movement is made toward the first versus the third quadrant. The general form of the information function for a given item, specific to direction α is I α (θ) = [ αp i (θ)] 2 P i (θ)q i (θ), (2.19) where α represents a vector of angles with the coordinate axes that give the direction from the θ-centroid; α is the directional derivative in the direction of α; P i (θ) is the

54 32 probability of a correct response; and Q i (θ) equals 1 P i (θ). The directional derivative, α, is computed as α P i (θ) = P i(θ) θ 1 cos α 1 + P i(θ) cos α θ P i(θ) cos α 2 θ m. m (2.20) Equation 2.20 can be generalized to the family of compensatory MIRT models, but the partial derivatives are taken with respect to the model-specific parameters. Maximum item information will be achieved in the direction of the steepest slope along the 0.5 probability contour line. To compute MIRT test information, the information for each item on the test is summed over, conditional on θ. Dimensionality Assessment Procedures In order to determine whether a UIRT or MIRT model is most appropriate to use with the data, it is necessary to conduct a series of dimensionality assessments. However, even the most thorough evaluation will not necessarily provide a correct answer to the dimensionality question because different approaches yield different conclusions. An important point made by Reckase (2009) is that dimensionality is not a property of the test, but depends on both the sensitivity of the test items to the latent traits they purport to measure and the variability in the examinee sample on those traits. For example, if a collection of test items is sensitive to three traits, but the examinee sample varies on just one of the traits, then the examinee locations will lie on a plane in the three dimensional space. In this case, the use of one dimension would likely be sufficient. The same logic applies to the situation where the sample of examinees is variable in respect to three traits, but the items are only sensitive to one. A similar scenario can arise even when test items and examinees are variable with respect to multiple dimensions, where all of the items measure a similar combination of traits. In this case, a unidimensional model may prove adequate as well. Therefore, dimensionality is best described as a characteristic of the item response matrix and not the test because it is sample dependent (Reckase, 2009).

55 33 After having reviewed the dimensionality assessment literature, Reise et al. (2000) suggested that it is better to overestimate the dimensionality of a data structure than to underestimate it. Similarly, it has been found that ability estimates can be oversimplified as a result of projecting from a high to a lower dimensional space (Reckase, 2009). On the other hand, as dimensionality increases so does the number of parameters that need to be estimated which in turn increases the chance for estimation errors. Regardless, dimensionality assessment is a critical precursor to accurate (M)IRT equating. Various exploratory and confirmatory procedures have been developed to help answer the dimensionality question. One approach is to first see whether unidimensionality holds and in the case that it does not, analyses proceed by determining the number of dimensions required to adequately model the item-examinee interaction. One of the more commonly used nonparametric procedures is DIMTEST (Stout et al., 1999; Stout et al., 2001). DIMTEST is a nonparametric procedure meaning that no model is fit to the data, whereas parametric procedures fit a model before analyses are carried out. DIMTEST is appropriate for dichotomous items only and tests for whether there is one or more than one dominant dimension. This is done by separating the test items into two subtests the Assessment Test (AT) and Partitioning Test (PT). The PT items are chosen as those that measure best along the dominant trait while the AT items are chosen so that they measure best in the direction most opposite to that of the PT items. A statistical test is used to test the null hypothesis of zero conditional covariances, where retention of the null hypothesis is taken as support that a UIRT model can adequately represent the data. When evidence suggests unidimensionality has been violated, the next step is typically to determine how many dimensions are needed to adequately represent the data. A nonparametric procedure commonly used for this purpose is DETECT (Zhang & Stout, 1999b) which stands for dimensionality evaluation to enumerate contributing traits.

56 34 DETECT is used to determine the number of dimensions needed to represent the data under the assumption of simple structure by using estimated pairwise conditional covariances. DETECT operates by looking for item clusters that are homogeneous and measure in a best direction that is distinct from the direction of best measurement for the entire test. Several indices are provided to help the researcher evaluate the degree of dimensionality. The DETECT index typically ranges between 0 and 5 where values of 1 or greater indicate large multidimensionality; values of 0.4 to 1 indicate moderate to large multidimensionality; values below 0.4 indicate moderate to weak multidimensionality; and values below 0.2 indicate unidimensionality (Jang & Roussos, 2007). Another index is referred to as the IDN index and tests for the proportion of estimated covariances that have signs in the expected direction and is computed by partitioning subtests. IDN values close to 1 indicate multidimensionality under the simple structure assumption. Lastly, a ratio is provided to evaluate whether the solutions obtained are stable or may have occurred by chance, where values near 1 indicate stability. Poly-DIMTEST (Li & Stout, 1995) is an extension of DIMTEST, but is equipped to handle both dichotomous and polytomous items with various score points. Poly-DIMTEST shares the theoretical foundation employed by DIMTEST and provides a single index indicating whether the test items are essentially unidimensional. Poly-DETECT (Yu & Nandakumar, 1997) is a modified version of DETECT that can incorporate dichotomous and polytomous items and also operates under the assumption of simple structure. Another method for assessing dimensionality is by means of a disattenuated correlation between sets of item clusters. The clusters can be formed in a confirmatory manner according to features such as item format, content area, or cognitive processes, to name a few. The disattenuated correlation represents the correlation between true scores by taking into account unreliability of the cluster scores. Disattenuated correlations typically range between 0 and 1 (although sometimes values > 1 are found), where values

57 35 close to 1 indicate the two item sets measure the same construct, thereby implying unidimensionality. Performing a principal components analysis (PCA) on the inter-item correlation matrix is another way to roughly estimate dimensionality. Tetrachoric correlations are used to estimate the relationship between pairs of dichotomous items and polychoric correlations are used for polytomous items. The polychoric correlation is the Pearson correlation between two normally distributed continuous latent variables that have been made ordinal. Whereas, the tetrachoric correlation is a special case of the polychoric correlation where both variables have been dichotomized. The tetrachoric and polychoric correlations are used instead of the traditional Pearson correlation because use of the latter may result in spurious dimensions due to item difficulty, which could overestimate the number of actual latent dimensions (Hambleton & Rovinelli, 1986; McDonald & Ahlawat, 1974). The threat of overestimation is less of a concern with tetrachoric and polychoric correlations. Criteria used to judge whether unidimensionality holds can include (a) looking for an elbow at the second eigenvalue in the scree plot, (b) comparing the ratio of the first eigenvalue to the second and the second eigenvalue to the remaining eigenvalues (Lord, 1980), and (c) seeing if the first eigenvalue accounts for at least 20% of the total variance (Reckase, 1979). Another procedure to determine the number of dimensions was proposed by Schilling and Bock (2005) in the form of a Chi-square (χ 2 ) difference test among nested models with m and m + 1 dimensions. According to Reckase (2009), the rule of thumb for interpreting this statistic should be adopted from the field of structural equation modeling such that good fit results when the χ 2 divided by its degree of freedom is less than three. A study by Tate (2003) researched the performance of this statistic and found it worked well under most conditions. Dimensionality assessments come in many varieties and depending on the researcher s background or evaluative criteria, the conclusions drawn may differ. That is

58 36 why it is important to consider not only statistical evidence but also substantive evidence in answering the dimensionality question. Lastly, it is worth emphasizing that in this dissertation, dimensionality refers to the number of coordinate axes in the latent space and not the number of constructs assessed by the test. The number of constructs relates to the number of reference composites represented by meaningful item clusters which can be different from the number of dimensions (Reckase, 2009). UIRT Scale Linking Procedures After the dimensionality has been determined and following the calibration of items and examinees there is one more step involved before equating can be conducted scale linking. Linking procedures for UIRT models will be presented next and those for MIRT models will be presented in the following section. Prior to conducting any form of IRT equating, Form X (New Form) and Form Y (Old Form) scores must be placed on the same scale through some form of a scale linking procedure. Oftentimes, parameter estimates from different IRT calibration programs are on different, but linearly related scales. In such a case, the two θ scales for X and Y are related by θ X = Aθ Y + B, (2.21) where A and B are constants that can be estimated using various methods. The relationships between item parameters for scales X and Y, under the UIRT 3PL model are a Xi = a Yi A, b Xi = Ab Yi + B, c Xi = c Yi. (2.22a) (2.22b) (2.22c) As can be seen, the pseudo-guessing parameter remains unchanged after scale transformation. Once item and examinee parameters have been transformed, response

59 37 probabilities on the new scale will be equivalent to response probabilities for the old scale. This is because of the indeterminacy of IRT scale locations and units of measurement. Even though this example uses a dichotomous IRT model, the same logic applies to polytomous IRT models, which will be discussed in more detail later. Scale linking procedures are sometimes unnecessary depending on the data collection design. Under the RG equating design, if the scaling is the same (i.e., μ = 0, σ = 1) during calibration for both forms, then parameter estimates are assumed to be on the same scale and do not require transformation. For the SG design, the same is true as with the RG using separate or concurrent calibration in both cases scale transformation is not required. For the CINEG design, separate calibration is usually performed and estimates from the common items are then used to link the two scales. In the cases where concurrent calibration is used with the CINEG design, the resulting parameter estimates for both forms are on the same scale, rendering scale linking unnecessary. In situations where scale linking is required, the procedures involved differ depending on the psychometric model and data collection design. Scale Linking Methods for Dichotomous IRT Models With the CINEG design, there are two scale transformation methods for dichotomous items that involve using moments of the item parameters. The first is referred to as the mean/sigma method (Marco, 1977) and uses the means and standard deviations of the b-parameters to compute the scaling constants, A and B, as A = σ(b X) σ(b Y ) and B = μ(b X ) Aμ(b Y ). (2.23) The second is referred to as the mean/mean method (Loyd & Hoover, 1980) and uses the mean of the a- and b-parameters to compute the scaling constants, A and B, as A = μ(a Y) μ(a X ) and B = μ(b X ) Aμ(b Y ). (2.24)

60 38 It is important to note that the moments used in the mean/sigma and mean/mean methods are from the common items only. Furthermore, even though the previous equations are defined in terms of parameters, estimates are typically used instead. The scale transformation methods presented so far consider item parameters separately, whereas those presented next consider them simultaneously through the IRT characteristic curve. The Haebara (1980) characteristic curve method specifies the difference between ICCs from Forms X and Y as Hdiff(θ j ) = [P i (θ Xj ; a Xi, x:v b Xi, c Xi) P i (θ Xj ; a Yi A, Ab Yj + B, c Yi)] 2. (2.25) That is, the difference in ICCs is computed conditionally on each particular θj for all common items (x: V) and then summed over. The Hdiff statistic is computed over a specified quadrature distribution of abilities and scaling constants A and B are found so that the following criterion is minimized: Hcrit = Hdiff(θ j ). j (2.26) The summation in Equation 2.26 is across the specified ability quadrature distribution. In contrast to the Haebara method, the Stocking and Lord (1983) characteristic curve method takes the summation in respect to the squared difference of the items, such as SLdiff(θ j ) = [ P i (θ Xj ; a Xi, b Xi, c Xi) x:v 2 P i (θ Xj ; a Yi A, Ab Yi + B, c Yi) ]. x:v (2.27) That is, the difference is taken between the test characteristic curves, conditional on each θj, and then squared. The SLdiff statistic is computed across a quadrature distribution of

61 39 abilities and scaling constants A and B are estimated so that the following criterion is minimized: SLcrit = SLdiff(θ j ). j (2.28) Similar to Hcrit, the summation for SLcrit is taken over the ability quadrature distribution. For the Haebara and Stocking and Lord methods, there are several options for choosing the quadrature points and weights in the computation of Hcrit and SLcrit (Kolen & Brennan, 2004). The characteristic curve methods more accurately depict the relationship between item parameters and the probability of a correct response by an examinee because the probability is dependent on the collective interplay of all the item parameters. Up to this point, the moment and characteristic curve methods for performing scale transformations have been applicable to dichotomous item types only. Similar procedures exist for cases when polytomously scored items are in need of transformation and are presented next. Scale Linking Methods for Polytomous IRT Models Scale transformation methods for polytomous items can be viewed as extensions of those previously presented for dichotomous items. Furthermore, the conditions that rendered scale transformation unnecessary for dichotomous items under the RG, SG, and CINEG designs, are equally applicable to polytomous items. The mean/mean and mean/sigma methods for polytomous items were presented by Cohen and Kim (1998) for the GR model. For the mean/sigma method, the mean and standard deviation of the b-parameters for the common items are calculated separately for Forms X and Y. Then, they are substituted into Equation 2.23 to compute the slope and intercept of the transformation equation. For the mean/mean method, the mean of the a- parameters is found along with the mean of the b-parameters, which are computed in the

62 40 same way as in the mean/sigma method. The means are then substituted into Equation 2.24 to obtain the slope and intercept of the transformation equation. A similar scale linking procedure exists for the mean/mean and mean/sigma methods under Muraki s (1992) generalized partial credit (GPC) model (Kolen & Brennan, 2004). For the mean/sigma method, the mean and standard deviation of the term, bi dih, are computed for each item category boundary for the common items. The mean and standard deviations are computed separately for Forms X and Y. Then, the mean and standard deviations can be substituted into Equation 2.23 to find the intercept and slope, respectively, of the transformation equation. For the mean/mean method, the mean of the a-parameters and the mean of the term bi dih, are computed and substituted into Equation 2.24 to find the scaling constants. Then, θ, a-, and b-parameter estimates are transformed by substituting the scaling constants into Equations 2.21, 2.22a, and 2.22b, respectively. The d-parameters are transformed by multiplying them by the slope found from Equation The characteristic curve transformation methods for polytomous items are similar to those for dichotomous items, except that the minimization criteria are taken over the category boundaries for all items. According to Kolen and Brennan (2004), the Haebara difference equation for the GR model can be computed as Hdiff(θ j ) = [P ik (θ Xj ; a Xi, b Xi2,..., b Xik,..., b XiKi ) i:v k:i P ik (θ Xj ; a Yi A, Ab Yi2 + B,..., Ab Yik + B,..., Ab YiKi 2 + B) ]. (2.29) Here, the first summation is taken over common items and the second over categories within items. Similar to the dichotomous version, Hcrit is found by substituting Equation 2.29 into Equation 2.26 and solving for the scaling coefficients that minimize the function.

63 41 The Stocking-Lord method for the GR model computes the difference as SLdiff(θ j ) = [ W ik P ik (θ Xj ; a Xi, b Xi2,..., b Xik,..., b XiKi ) i:v k:i W ik P ik ( i:v k:i + B,..., Ab YiKi + B)] θ Xj ; a Yi A, Ab Yi2 + B,..., Ab Yik 2. (2.30) Here, Wik represents the scoring function and takes on the score associated with category k. The difference for the Stocking-Lord method is computed between test characteristic curves. SLcrit is found by substituting Equation 2.30 into Equation For the GPC model, Hdiff and SLdiff are computed by first reparameterizing the GPC parameters to Bock s nominal model. This is possible because the GPC is a special case of the nominal model. The category response function for Bock s nominal model can be expressed as P ik (θ j ; a i1, a i2,..., a iki, c i1, c i2,..., c iki ) = e(a ikθ j +c ik ) K j e (a ihθ j +c ih ) h=1. (2.31) Here, aik and cik represent the slope and intercept parameters of the k th category for item i, respectively. Equations 2.32a and 2.32b can be used to express the GPC model in terms of the nominal model as a ik = Dka i, (2.32a) and c ik = Dka i b i + Da i d ih. K h=1 (2.32b) The parameters on the left side of the equation correspond to the nominal model and those on the right correspond to the GPC model. Next, Hdiff for Bock s nominal model can be computed as

64 42 Hdiff(θ j ) P ik (θ Xj ; a Xi1,..., a Xik,..., a XiKi, c Xi1,..., c Xik,..., c XiK i ) 2 = i:v k:i [ P ik (θ Xj ; a Yi1 A,..., a Yik A,..., a YiK i A, c Yi1 B A a Yi1,..., c Yik B A a Yik,..., c YiK i B A a YiK i ) ]. (2.33) Hcrit can be found by substituting Equation 2.33 into Equation 2.26 and summing over the ability quadrature distribution and minimizing the function. A polytomous version of the Stocking and Lord characteristic curve method for the GPC or nominal model was presented by Baker (1993) for ordered item response categories. Using notation from Bock s nominal model, SLdiff can be computed as SLdiff(θ j ) = [ W ik P ik (θ Xj ; a Xi1,..., a Xik,..., a XiKi, c Xi1,..., c Xik,..., c XiK i ) i:v k:i W ik P ik (θ Xj ; a Yi1 A i:v k:i B A a Yik,..., c YiK i B A a Yii)],... a Yik A,..., a YiK i A, c Yi1 B A a Yi1,..., c Yik 2. (2.34) SLcrit can be found by substituting Equation 2.34 into Equation 2.28 and summing over the ability quadrature distribution. Again, this method is only appropriate for response categories that are ordered. The scale linking methods presented thus far have been developed for tests where the unidimensionality assumption has not been severely violated. If these methods were applied to multidimensional data the linking results would not be valid. A new set of scale linking procedures are required for linking multidimensional mixed-format tests and will be presented next.

65 43 MIRT Scale Linking Procedures As is the case for UIRT equating, multidimensional equating requires that two forms be placed on the same scale prior to equating. Several multidimensional scale linking procedures have been presented in recent years (Davey et al., 1996; Li & Lissitz, 2000; Min, 2003; Oshima et al., 2000). The procedures for multidimensional scaling are more complex and involved due to the need to transform as many scales as there are coordinate axes. Rather than requiring two scaling constants, A and B, to account for differences in origin and unit of measurement as in UIRT, scalars and matrices are required to account for differences in translation, dilation, correlation, and rotation. Translation and dilation in MIRT are similar to the origin and unit of measurement in UIRT, respectively. The next section provides background information on the indeterminacies of compensatory MIRT models. Translation of the Origin In MIRT, a source of indeterminacy is the origin of the θ-space and through translation the invariance property of IRT can be preserved. During translation, examinee locations remain fixed while the coordinate axes locations change. The vector of coordinates for person j in the new coordinate system (υj) can be expressed as υ j = θ j δ, (2.35) where δ is a vector of locations of the new coordinates using the old coordinate system. The person locations in the old coordinate system can be converted to locations in the new coordinate system as, υ = θ 1δ. (2.36) Here, 1 represents a vector of 1s. After translation of the origin, the means of the θ- coordinates change, but their standard deviations and inter-correlations remain the same. In order to maintain the invariance property, item parameters must also be converted to

66 44 the coordinate system with the new origin. Using the M2PL model as an example, the d- parameter vector (d) is converted to the new coordinate system as d = d + aδ. (2.37) Here, d represents the d-parameter vector in the new coordinate system. It should be noted that the a-parameters are left unchanged. The exponential expressions in the M2PL model for the old and new coordinate systems are aθ' + 1d and aυ' + 1d, respectively and give equivalent response probabilities (Reckase, 2009). Unit of Measurement of the Coordinate Axes Another source of indeterminacy in MIRT is associated with the unit of measurement for each of the m coordinate axes. This concept is similar to the UIRT solution but instead of dealing with one unit of measurement there are m units which correspond to the m coordinate axes. Similar to the UIRT case, the unit of measurement for any given dimension can be established by setting the standard deviation of the location coordinates to a certain value (i.e., 1). However, when two groups have different coordinate location variances, the standard deviation set by the calibration software will not have the same meaning. For example, if Group A has a standard deviation of 0.9 on the first dimension and Group B has a standard deviation of 0.6, then a person in Group A with a θ1 coordinate of 0.5 would have a coordinate of 0.75 on Group B s coordinate system. This is because the units for Group B were smaller, but the values on the coordinate axes are larger. In order to get the examinee s θ1 from Group A on the Group B coordinate system, it needs to be multiplied by 0.9/0.6 = 1.5. This process is referred to as dilation of the coordinate points because they are getting larger, whereas the term contraction is used to describe situations where the points become smaller. In order to change the person coordinates from one set of units to those of another set of units, the θ-matrix is multiplied by a matrix of scaling constants (C),

67 45 σ m1 σm2 C = [ 0 0 σ m1 σm2 The scaling matrix above is for a two-dimensional solution. The upper left hand entry represents the ratio of the standard deviation on dimension 1 for the first group to that of the second group. The transformed person coordinates (υ) can be found by postmultiplying θ by C. Item parameters are transformed to the new unit of measurement by postmultiplying the a-matrix by the inverse of the scaling constant matrix, ac -1. After the person and item parameters have been transformed, the response probabilities will be equivalent to those before dilation or contraction. Transforming the m units of measurement will change the multidimensional discrimination (MDISC), difficulty (MDIFF), and coordinate angle values, and results in the same outcome as if a nonorthogonal rotation of the axes was performed (Reckase, 2009). Rotation of the Coordinate Axes Orientation of the coordinate axes is an indeterminacy of MIRT models and can be addressed through rotation methods. In order to transform the coordinates of the original θ-space to the coordinates of a new θ-space, the original coordinate matrix must be multiplied by the orthogonal rotation matrix (Rot), which is given as cos α sin α Rot = [ sin α cos α ], for a two-dimensional solution, where α represents the degree of rotation in the clockwise direction. The matrix of abilities (θ) and matrix of item discriminations (a) are postmultiplied by the rotation matrix, whereas the d-parameter remains unaltered. It is important to note that the inverse of this orthogonal rotation matrix is its transpose so that Rot Rot = RotRot = I, the Identity matrix. The rotation matrix adjusts for the mean, variances, and orientation of the m coordinate axes, but does not change the distance of the item and person coordinates from the origin. With dilation, the mean and variances of the coordinate axes are adjusted, but the correlation between them remains unchanged. ].

68 46 When rotating in a higher dimensional space, the process can be simplified by separating the rotation into a series of rotations around each of the orthogonal coordinate axes (Reckase, 2009). The full rotation is then taken as the product of the each of the separate rotations. Orthogonal rotation of the coordinate axes will change the mean and standard deviation of the θ-vectors as well as their inter-correlations. A common procedure for determining the rotation angle is the Procrustes method which specifies a target matrix and finds the rotation that minimizes the squared distance between the elements in the target matrix and those in the rotated matrix. A more thorough explanation of the Procrustes method can be found in Gower and Dijksterhuis (2004). A complete scale transformation usually requires a rotation of the axes, dilation of the scale units, and translation of the origin. The sequence in which each of these is carried out will affect the properties of the transformed coordinate system. There are many combinations that can be used to transform the item and examinee parameter estimates, and none of which is necessarily better than the other. Therefore, decisions on how to best transform multidimensional latent test space will depend on practical considerations of the testing program, such as test purpose and use. Random Groups Design Under the random groups (RG) design, it is assumed that the new and Old Form groups have equal origins and units of measurement in the m-dimensional test space because they are randomly equivalent. This would typically dismiss the need to adjust for translation and dilation in MIRT scale linking under the RG design. However, if evidence suggests large group differences then it may be wise to transform the coordinate systems so that the origins and units of measurement are equivalent (Reckase, 2009). When group differences are not large, the orientation of the axes still might not be equivalent between groups, which would call for a rotation of the axes. In addition to rotational indeterminacy, item and ability parameter estimates are also subject to correlational

69 47 indeterminacy (Thompson et al., 1997). The correlational indeterminacy problem is often solved by specifying that the latent abilities follow a multivariate normal distribution and are mutually orthogonal to one another. A study by Brossman and Lee (2013) found that MIRT scale linking under the RG design was not required in order to conduct observed score full MIRT equating. This held for situations where the ability density distribution was specified as multivariate normal, the solution was orthogonal, and the variance-covariance matrix was the Identity matrix. Specifically, they found that the conditional distribution (f (x θ)) and ability density distribution (ψ (θ)) were the same before and after rotation, such that the equating results were identical. There are no known similar conditions that exist for the nonequivalent groups design, and therefore scale linking is always necessary prior to equating. This dissertation focusses on equating under the RG design, thereby eliminating the need for scale linking. Nonequivalent Groups Design Under the CINEG design, efforts are made to match statistical and content specifications between the common items and the entire test so that the common items function as a mini version of the full-length test. Any group differences in the common item parameter estimates can be attributable to the calibration settings used in the estimation program to set the scale. As pointed out by Reckase (2009), the item parameter estimates are expected to be the same across groups if the same coordinate system is used because the items are identical. In practice however, the estimates are usually not identical and a transformation of the coordinate axes is done to make them as similar as possible across forms. Hirsh (1989) presented a method of scale linking for MC-only tests with common examinees for a two-dimensional solution. His method required the estimation of four rotation matrices, two dilation parameters, and two location parameters. After scale transformation, means and standard deviations of the ability estimates for the common

70 48 examinees on the Old and New Forms were used to compute traditional scaling constants. The constants were then used to transform the examinee and item parameter estimates on the New Form so that the means and standard deviations were similar to the Old Form. Although this work was seminal in the development of MIRT linking procedures, its use was limited and several other procedures have been developed since then. Three procedures for linking item parameter estimates from two forms calibrated under the M3PL model with the CINEG design were presented in a simulation study by Davey, Oshima, and Lee (1996). The first procedure was referred to as matching scaling functions and was an extension of the mean/sigma method used with UIRT models. The second procedure was an extension of Stocking and Lord s (1983) test characteristic curve method and was referred to as the matching test response functions or surfaces. The third procedure was an extension of Divgi s (1985) procedure for UIRT models that minimized the square difference between common item parameter estimates across forms. This would later be referred to as the Direct method (Davey, Oshima, & Lee, 2000). The third procedure is comparable to a Procrustes rotation (Hurley & Cattell, 1962), where a rotation matrix, Rot, is found that minimizes a least squared difference function between two matrices of factor loadings. A few years later, Davey et al. (2000) presented four separate MIRT linking methods, three of which, were carried over from Davey et al. (1996). The fourth, and new method, was referred to as the Item Characteristic Function method and is a multidimensional extension of the Haebara (1980) method. All four methods involve the simultaneous estimation of a rotation matrix and a translation vector. The rotation matrix accounts for both rotational and dilational indeterminacies. Similar to conclusions drawn from UIRT linking methods, the characteristic curve methods produced the most stable linking coefficients. Li and Lissitz (2000) evaluated the accuracy of several different MIRT linking procedures for dichotomous items under the M2PL model. Specifically, they used

71 49 Procrustes rotation to adjust the coordinate axes and then compared three different methods for estimating two translation parameters and a single dilation parameter. The translation parameters were estimated by using a matched test response surface (MTRS) method and a least squares method. The dilation parameter was estimated using the MTRS method, a ratio of the eigenvalues of the discrimination matrices, and a least squares method involving the ratio of the trace centered discrimination matrices, which was distinct from that used to estimate the translation parameters. They concluded that the ratio of the trace was the best method for estimating the dilation parameter and the least squares method was the most accurate for estimating the translation parameter. The performance of MIRT scale linking methods appears to be dependent two things: (a) how well the underlying dimensions can be extracted from the new and Old Forms, and (b) how accurately those dimensions can be matched across forms. Since data dimensionality is a product of the interaction between examinee and item characteristics, matching dimensional structures across test forms and groups can be quite challenging. However, Li and Lissitz (2000) demonstrated that θ estimates (but not expected numbercorrect scores) can be successfully recovered by linking on just one matched dimension when dimensionality is very different across forms. This is an added challenge of using the CINEG design with MIRT models and more research is needed in this area. Since the focus of this dissertation is on MIRT equating using the RG design, linking was not necessary. UIRT Equating Methods IRT equating can be conducted using either a true score or observed score method, and both generally utilize number-correct scores. This is because even though IRT may be used during test development, scores are often reported in the numbercorrect metric because of complications associated with the θ scale. For example, pattern scoring can result in examinees answering the same number of items correctly, but

72 50 receiving different θ estimates, which is hard to explain to test consumers. Therefore, IRT true score and observed score equating methods typically use number-correct scores. The process of IRT true score equating involves equating true scores through use of the TCC. Specifically, the IRT true score on Form X which is associated with a given θj is considered equivalent to a true score and respective θj on Form Y. For the 3PL UIRT model, the definition of number-correct true score on Form X (the New Form), associated with a given θj is given as τ X (θ j ) = P i (θ j ; a i ; b i ; c i ), (2.38) j:x where the summation is taken over all items on Form X. A similar process is used to relate IRT true scores to θj on Form Y. When the 3PL model is used, true scores below the sum of the c-parameters do not exist and the relationship between true scores and θj only holds for true scores in the range, c i < τ X < n X and i:x c i < τ Y < n Y. i:y (2.39) Here nx and ny represent the total number of items on Form X and Y, respectively. When estimated true scores fall outside the range specified in Equation 2.39, ad hoc procedures developed by Lord (1980) and Kolen (1981) can be used. True score equating involves three general steps: (1) specify a true score τx on Form X, (2) identify the latent trait score (θj) that corresponds to that true score, and (3) find the corresponding true score on Form Y, τy, that corresponds to that θj (Kolen & Brennan, 2004). The second step requires the use of an iterative numerical procedure called the Newton-Raphson method. The Newton-Raphson method solves for θj by setting the function (func(θ j )) τ X P i (θ j ; a i ; b i ; c i ) i:x to zero and then using the first derivative of this function, (func (θ j )), which is given by i:x P i (θ j ; a i ; b i ; c i ) where

73 51 P i (θ j ; a i ; b i ; c i ) is the first derivative of P i (θ j ; a i ; b i ; c i ) with respect to θj. The initial ability estimate (θ j ) is updated as θ j + = θ j func(θ j) func (θ j ). (2.40) The updated ability estimate (θ j + ) is expected to be closer to the true ability that the initial estimate (θ j ) and this process continues with θ j + serving as θ j in the next step, and so on until there is very little change between θ j and θ j +. IRT true score equating is designed to equate true scores, however true scores are never known and as a result observed number-correct scores are used in their place even though no theory supports this use (Kolen & Brennan, 2004). The first step in IRT observed score equating is to estimate number-correct score distributions on both forms using an IRT model or a combination of models in the case of mixed-format tests (Kolen & Brennan, 2004). Conditional distributions of observed number-correct scores for a given θj are modeled using the compound binomial and multinomial distributions for dichotomous and polytomous models, respectively. Traditional equipercentile methods are then used to equate the two model-based estimated observed score distributions. For dichotomously-scored items and under the assumption of local independence, number-correct score distributions for each θj is estimated using the Lord and Wingersky (1984) recursion formula. First, fr(x θj) represents the number-correct score distribution over the first r th items for an examinee with ability θj. The probability of earning a score of 0 on the first item is equal to f1(x = 0 θj) = (1 Pj1) and the probability of earning a 1 is equal to f1(x = 1 θj) = Pj1. For a test with r > 1, the recursion formula is given as, f r (x θ j ) = f r 1 (x θ j )(1 P jr ), x = 0 = f r 1 (x θ j )(1 P jr ) + f r 1 (x 1 θ j )P jr, 0 < x < r, = f r 1 (x 1 θ j )P jr, x = r. (2.41)

74 52 Using the recursion formula, the number-correct score distribution for each θj is estimated using all r items. A similar recursion formula was presented by Hanson (1994) and Thissen et al. (1995) for polytomously scored items, and will be presented in Chapter 3. Once the conditional number-correct score distribution is estimated, the marginal number-correct score distribution can be estimated. If the ability distribution is assumed to be continuous, the marginal distribution is given by f(x) = f(x θ)g(θ) dθ, θ (2.42) where g(θ) represents the θ distribution. If a discrete θ distribution with equally spaced quadrature points is specified, the integral in Equation 2.42 can be approximated by f(x) = f(x θ j )g(θ j ). j (2.43) Furthermore, when the ability distribution is discrete with a finite number of N abilities, the marginal number-correct score distribution can be computed as f(x) = 1 N f(x θ j). j (2.44) Equation 2.44 would be used with a Monte Carlo method (i.e., drawing a large number of simulees from a specified population) when there is a large number of latent dimensions. Once the marginal ability distributions for Form X and Y have been estimated, they are equated using traditional equipercentile methods. Use of the IRT true and observed score equating methods just presented requires that the IRT assumption of unidimensionality be met. Past studies have suggested that IRT equating results are robust to violations of this assumption as long as the violations are not severe (Bolt, 1999; Camilli et al., 1995; Cook et al., 1985; De Champlain, 1996, Doran, & Kingston, 1985). However, in cases where test items are believed to measure multiple abilities, attempts need to be made to model the interaction of test items and

75 53 examinees as accurately as possible. It is likely that MIRT models provide a better fit than UIRT models in some instances (Li, Li, & Wang, 2010), and as a result, MIRT scale linking and equating procedures are becoming more prevalent. The next section provides background information concerning the advancements that have been made in the area of MIRT equating. MIRT Equating Methods The need for MIRT equating procedures is rapidly growing as more and more evidence suggests that the test taking process is more complex than previously perceived. In order for equating to be done with the highest possible level of accuracy, it should be expected that the psychometric model needs to fit the data well. MIRT methods have the potential to improve the accuracy of equating as a direct result of more accurately modeling the multi-faceted test-taking process. Slowly, researchers have begun to disentangle the challenges accompanied by using more complex psychometric models and have started offering solutions to the MIRT equating problem. A study by Lee and Brossman (2012) presented observed score equating procedures under a simple-structure (SS) MIRT framework for mixed-format tests and compared its results with those obtained using UIRT and traditional equipercentile methods. An advantage of the SS-MIRT procedure when used in conjunction with the RG design, is it doesn t require rotational indeterminacy be accounted for prior to equating, even when the factors are correlated. In this study, dimensionality was associated with item format such that θ1 and θ2 represented MC and FR items, respectively. The steps involved were much like UIRT observed score equating, with the exceptions of (a) MC and FR item types were calibrated separately, (b) a bivariate latent ability distribution was specified, and (c) the conditional observed score distribution for each item type had to be estimated separately and then combined across item types. Specifically, the conditional total score distribution, f (x θ1, θ2), was computed as

76 54 f(x θ 1, θ 2 ) = P(X = x θ 1, θ 2 ) = f 1 (x 1 X=w 1 X 1 +w 2 X 2 θ 1 )f 2 (x 2 θ 2 ), (2.45) where summation is over all possible pairs of weighted part scores that produce a specific total score, x. The conditional observed score distributions for each item type, f 1 (x 1 θ 1 ) andf 2 (x 2 θ 2 ), are computed using the Lord and Wingersky (1984) recursive algorithm or an extension of it in the case of polytomous items (Hanson, 1994). Replacing integration with summation and using discrete quadrature points, the marginal observed score distribution was found by combining conditional total-score distributions over a bivariate ability distribution g(θ1, θ2) as, f(x) = f(x θ 1, θ 2 )q(θ 1, θ 2 ), (2.46) θ 1 θ 2 where q(θ1, θ2) is the density function of the bivariate normal distribution for a specific combination of θ1 and θ2. Results from real data analyses indicated that between the UIRT and SS-MIRT procedures, the latter was found to produce results most similar to the traditional equipercentile method, and led to less total equating error when used with multidimensional data. The first full MIRT equating procedure was illustrated by Brossman and Lee (2013), but for MC-only tests. Specifically, they provided examples of three equating methods: (1) Full MIRT observed score equating, (2) unidimensional approximation to MIRT observed score equating, and (3) unidimensional approximation to MIRT true score equating. Both UIRT approximation methods were extensions of earlier work by Zhang and colleagues (Zhang, 1996; Zhang & Stout, 1999a; Zhang & Wang, 1998). In general, they found that multidimensional procedures performed more similarly to traditional equipercentile methods, while the UIRT method resulted in more systematic error. Interestingly, both unidimensional approximation methods produced equating results that were very similar to the Full MIRT method.

77 55 Recently, a study by Lee and Lee (2014) presented an MIRT observed score equating procedure for mixed-format tests using the Bifactor model. Again, multidimensionality was associated with item format. The 2PL and GR Bifactor models were fit to a series of mixed-format tests representing different content areas using three types of datasets: (1) matched samples from intact groups, (2) pseudo-forms from a single group, and (3) a simulation study. An orthogonal trivariate distribution for each combination of the general and group-specific traits was specified by using an extended version of the Lord and Wingersky (1984) and Hanson (1994) recursive formulas for dichotomous and polytomous items, respectively. Marginal observed score distributions were computed across the trivariate distribution, q(θ G, θ M, θ F ), as f(x) = f(x θ G, θ M, θ F )g(θ G, θ M, θ F )dθ G dθ M dθ F. (2.47a) In the case of a discrete ability distribution, the integration in Equation 2.47a can be approximated with summation using a set of quadrature points and weights so that f(x) is computed as f(x) = f(x θ G, θ M, θ F )q(θ G, θ M, θ F ), θ G θ M θ F (2.47b) where, q(θ G, θ M, θ F ) is the density function of the trivariate normal distribution for the general, MC-specific, and FR-specific latent traits. Equating was carried out using traditional equipercentile methods after the new and Old Form marginal distributions were obtained. In general, they concluded that when the Bifactor model fit the data better, equating results favored the Bifactor method over the UIRT method. After reviewing the literature, there is a recognizable need for further research on equating multidimensional tests. While there have been a handful of MIRT equating studies, none have presented solutions for conducting full MIRT equating with mixedformat tests. The added complexity of incorporating non simple-structure MIRT models

78 56 and MIRT models for polytomous items is likely to be a topic of interest for many testing programs. The purpose of this dissertation was to address these complexities and provide illustrations of full MIRT observed-score equating procedures for mixed-format tests.

79 57 CHAPTER III METHODOLOGY Chapter Three provides background information on the datasets used in this dissertation along as the original data collection method, test blueprints, and scoring rules. Also included in this chapter are details pertaining to data manipulations, dimensionality assessment procedures, equating procedures, and evaluation criteria. Data and Procedures The data used in this dissertation were from the College Board s Advanced Placement (AP) Exams administered in 2011, and were originally collected under the CINEG design. AP exams are used to award students with college credit for AP courses taken while still in high school. The decision to award college credit is made by postsecondary institutions and is based on students AP Exam grades, which are on a scale from 1 to 5. Although the College Board offers AP Exams in 34 subject areas, only 3 were used in this dissertation. Specifically, the main and alternate forms for Chemistry, Spanish Language, and English Language and Composition were chosen so that different content areas would be represented. Each exam is mixed-format and contains MC items, all with five response options, and FR items with various numbers of response categories. Even though AP Exams were used in this dissertation, adjustments were made that changed the psychometric properties of the exams in such a way that results cannot be generalized to the operational AP Exams. Specifics related to these modifications will be described in more detail later. Furthermore, the conversion tables used in this dissertation were devised for research purposes only, and do not reflect the operational conversion tables. As a result, the focus of this dissertation is on the equating methods and not the characteristics of the specific AP Exams used herein. The Chemistry main and alternate forms were given to 116,330 and 3,592 examinees, respectively and each consisted of 75 MC items and 6 FR items. The Spanish Language main and alternate form were given to a total of 118,176 and 3,243 examinees,

80 58 respectively and each contained 70 MC items and 4 FR items. For Spanish Language, 34 of the MC items were listening tasks and 36 were reading comprehension tasks. The English main and alternate forms were administered to 404,970 and 5,141 examinees, respectively and each consisted of 54 MC items and 3 FR items. The possible score points and operational weights for the Chemistry, Spanish Language, and English items on the main and alternate forms, can be found in Tables A1 and A2, respectively. Even though the alternate and main forms shared a set of common items, they were not treated as such for the purpose of this dissertation. Instead, the common items were treated as operational items in order to maintain test length and adhere to the operational test specifications. Operationally, weights are applied to examinees MC and FR section scores to obtain composite raw scores. These composite raw scores are then transformed to composite scale scores that range from 0 to 70, and to AP Grades that range from 1 to 5. The scale scores were developed operationally to be normally distributed with a mean of 35 and standard deviation of 10, for the main form. Once the alternate form composite raw scores are equated to the main form scale, equated composite scores are transformed to scale scores using the raw-to-scale conversions in Tables A3 to A5. The AP Grade scale was also developed for the main form using operational cut scores that were established using standard setting procedures. Similarly, equated composite raw scores from the alternate form are converted to AP Grades using the conversions in Tables A3 to A5. In the past, AP exams have been scored such that guessing was penalized, which is also known as formula scoring. The 2011 administration of the AP Exams marked the first year that number-correct scoring was used operationally such that missing and incorrect responses received a score of zero. This scoring protocol was carried through into this dissertation, however other aspects were modified. For example, instead of using the operational non-integer weights, integer weights rounded to the nearest whole

81 59 number, were used. This ensured that composite scores would be integers as well, which eliminates the complications of using non-integer scores in equating procedures. However, this sometimes resulted in the main and alternate forms having different maximum weighted composite scores. As a result, the integer weights had to be adjusted or in some instances, an FR item was removed. Therefore, the weights and maximum score points used in this dissertation do not match those used operationally. The comparison of weights and items used operationally and in this dissertation can be found in Tables A1 and A2. Operationally, equating is conducted on composite raw scores such that weighted composite scores from the alternate form (treated as the New form) are equated to the main form scale (treated as the Old Form). Following this pattern, the alternate form will be referred to as the New Form, and the main form will be referred to as the Old Form. Thus, equating is done on composite raw scores and scale scores are included to facilitate score interpretation. Matched Samples Datasets Given that the AP Exams were collected under the CINEG design and this dissertation focusses on the RG design, some data manipulation was required in order to form randomly equivalent groups. One approach is to create pseudo-forms that are halflength versions of an operational form. Oftentimes, pseudo-forms are used to create an artificial SG design because the act of splitting a form into halves results in the same set of examinees having taken both pseudo-forms A and B. In this context, it can be assumed that the group who took pseudo-form A are randomly equivalent to those who took pseudo-form B, because they are the same set of examinees. The downside to using pseudo-forms is that full-length forms need to be divided into shorter forms, which not only disregards operational test development efforts, but also poses specific challenges to detecting multidimensionality in the data. For example, the multidimensionality contained within the full-length operational form may not be as pronounced in the halflength pseudo-forms. Given that the focus of this dissertation is on MIRT equating, this

82 60 was considered a severe limitation. Another challenge is that many of the AP Exams contain item clusters in the MC section, which makes the task of creating pseudo-forms with equivalent content and statistical specifications more burdensome. For these reasons, pseudo-forms were not utilized in this dissertation. An alternative to using pseudo-forms is to sample a subset of examinees from the main and alternate form groups using a selection variable that is related to the construct of interest, which in this dissertation, is achievement (as measured by total scores). Sampling is done with the intent to create two matched subsamples that are similar enough in ability to be considered randomly equivalent. There are several selection variables to choose among in the AP datasets such as gender, ethnicity, year in school, reduced exam fee, geographic location, and parental education level. A study by Powers and Kolen (2012) found support for the use of one or several different matching variables over no matching variable when creating randomly equivalent groups using AP data. However, the choice of which matching variable(s) to use is important, as it has been shown that some are more appropriate than others. For example, several studies have shown that matching on common-item scores can lead to biased equating results (Eignor, Stocking, & Cook, 1990; Livingston, Dorans, & Wrights, 1990), and for that reason common-item scores were not used in this dissertation. In order for the matched samples to be approximately equivalent in ability levels (as measured by common-item scores), there must be a moderate to strong relationship between the matching variable and achievement level. Past studies with AP Exams have used ethnicity and parental education (Hagge & Kolen, 2011), reduced exam fee (Powers et al., 2011), parental education (Powers & Kolen, 2011), and gender, reduced exam fee, and parental education (Lee & Lee, 2013), as matching variables. In each of these studies, the chosen sampling variable(s) were found to be adequate in producing matched-samples, however the AP subject areas varied from study to study. Thus, for this dissertation a series of single

83 61 matching variables and combinations of matching variables were tested to find which could best approximate random equivalent groups. In order to evaluate the extent to which randomly equivalent groups were successfully created, measures of effect size for common item scores were computed between alternate and main form subgroups. The formula used to compute the effect size for group differences was a modification of the Cohen s d statistic that uses a pooled standard deviation. Specifically, effect size was computed as ES = x M x A (N M 1)σ M2 + (N A 1)σ A2 N M + N A, (3.1) where x M is the mean of the common item scores on the main form, σ M2 is the variance of scores on the main form, and N M is the size of the subsample for the main form. The same notation is used for the alternate form, which is denoted by subscript A. According to Kolen and Brennan (2004), for the CINEG design, effect sizes of 0.1 or less are preferable, those equal or greater than 0.3 can result in differences between equating methods, and those larger than 0.5 are particularly problematic. Since this dissertation focused on the RG design, the maximum effect size allowable was considered between ± 0.05 because groups were assumed to be randomly equivalent, whereas this assumption is not made under the CINEG design. One limitation to using matched samples is that a large amount of data can be discarded if distributions of the matching variable(s) are not similar in the two parent samples. Similarly, it may be difficult to produce pseudo-groups that are close enough in achievement levels (as measured by common-item scores) to be considered randomly equivalent. A benefit of using matched samples as opposed to pseudo-forms, is that the multidimensionality of the data is not diluted by using half-length forms. Furthermore, the operational dimensionality of the Old and New Forms is unlikely to be affected if large enough subsamples can be extracted.

84 62 Since real data were used in this dissertation, the population equating relationship could not be known. Therefore, an estimate of the population equating relationship was used to obtain target equivalents that will be used to evaluate the equating methods being studied. For the Matched Samples datasets, the population equating relationship was approximated by conducting traditional equipercentile equating with presmoothing and treating the resulting equivalents as target equivalents (Lee & Lee, 2013). Using the equipercentile method to approximate the population equating relationship has been used in previous studies (Brossman & Lee, 2013; Lee & Lee, 2013), and appears to offer a theoretically sensible, yet non-ideal solution. The equipercentile method is a preferable option because it does not speak to the dimensionality of the data as do (M)IRT methods. Thus, there is no theoretical reason to suspect that results obtained from the equipercentile method will align more closely with any of the proposed (M)IRT methods, which reduces the likelihood that results will be biased. Lastly, the equipercentile method is similar to the IRT methods in that it specifies the equating relationship as curvilinear, unlike linear and mean equating methods. Single Population Datasets Since the use of the traditional equipercentile equating method as a criterion was not without limitations, another criterion was included to corroborate results. However, in order to have another criterion, a different sampling design was required and resulted in what are being referred to as Single Population datasets. These were created by randomly sampling examinees from the Old Form group only and assigning them to pseudo-old and pseudo-new Form groups, each with sample sizes of 3,000. For these datasets, equating is theoretically unnecessary since the same form (i.e., the original Old Form) was administered to both pseudo-groups and those pseudo-groups come from the same population of examinees (i.e., Old Form examinees). Therefore, the Identity equating method was used as the criterion with these datasets, where a score on the New Form is equal to the same score on the Old Form. If the New Form score were subtracted from the

85 63 Old Form equivalent to create a difference plot, the Identity line would be a straight line at the difference of zero. Thus, the Identity relationship makes no adjustments to account for differences in form difficulties because none exists. Any adjustments observed in the estimated equating relationships represent random and/or systematic error and can be compared to the Identity relationship. This approach is not perfect however because some error is expected in the estimated equating relationships due to sampling. Therefore, slight deviations were expected when an equating method was applied to the Single Population datasets since samples were used instead of populations. For these datasets, equating procedures are effectively adjusting for sampling error. This approach is also known as equating a test to itself (Kolen & Brennan, 2004) and has been used as a criterion in other equating studies when the true equating relationship was unknown (Petersen et al., 1982). However, this approach is also not without limitations and has been found to favor equating procedures with the fewest number of parameters (Brennan & Kolen, 1987a; 1987b). As a result, Kolen and Brennan (2004) recommend that this approach be used to detect poor performing equating methods rather than pinpoint the best performing method. Traditional equipercentile with presmoothing was included as a condition with the Single Population datasets in order to determine how much it deviated from the Identity relationship. If there was a large discrepancy, this could signal that the traditional equipercentile method may not be a good proxy to the population equating relationships for the Matched Sample datasets. If the traditional equipercentile method deviated more from the Identity criterion than the (M)IRT methods, this could suggest it is more sensitive to random error invoked by using samples rather than populations. An added benefit of including the Identity criterion is that the sampling method used to create the Matched Samples datasets could be evaluated. If using a matchingvariable to create randomly equivalent groups for the Matched Sample datasets was successful, then the standardized mean difference (or effect size) between pseudo-groups

86 64 for those datasets should be similar to those obtained for the Single Population datasets. This is because for the Single Population datasets, random sampling was used without any matching-variable and sampling was done from the same population. Therefore, the standardized mean difference between group ability levels for the Single Population datasets served as a baseline representing the differences expected between two groups that are randomly equivalent. While there were clear benefits of adding another equating criterion, doing so effectively doubled the number of conditions in this dissertation because the same procedures were carried out in both the Matched Samples and Single Population datasets. Dimensionality Assessment Dimensionality assessments can be helpful when trying to decide between competing equating methods that incorporate IRT. For example, results of dimensionality assessments can be used to evaluate whether the unidimensionality assumption associated with UIRT equating methods has been severely violated. When evidence suggests unidimensionality is violated, alternative equating methods such as the traditional equipercentile or MIRT observed score methods, can be used instead. Furthermore, dimensionality assessments can also be used to make predictions about how results from different equating methods might compare to one another. For example, when evidence suggests that two test forms are unidimensional, it might be expected that equating results from the UIRT and equipercentile methods would be more similar than those between the MIRT and equipercentile method. Several techniques were used to thoroughly assess the dimensionality of the AP Exams in regard to item format and content specification, each considered separately. Analyses were conducted separately on the main and alternate forms using each procedure discussed next. First, the disattenuated correlation between section scores (where the section was defined by either item format or content specification, but not both) was computed as,

87 65 ρ S1 S 2 ρ T1 T 2 = (ρ S1 S 1 )(ρ S2 S 2 ), (3.2) where ρ S1 S 2 represents the observed correlation between scores on sections 1 and 2, ρ S1 S 1 is the coefficient alpha reliability of section 1 scores, and ρ S2 S 2 is the coefficient alpha reliability of section 2 scores. When considering dimensionality associated with item format, sections 1 and 2 correspond to total scores for MC and FR items, respectively. When looking at dimensionality due to content, sections 1 and 2 represent groups of items that measure distinct content subdomains, as specified in the operational test blueprint. Disattenuated correlations close to 1 indicate the two sections measure the same construct. If all pairwise disattenuated correlations are near 1, this can be taken as support that unidimensionality holds for that test form. It should be noted that coefficient alpha assumes essential tau equivalence for each section. This may be too strict of an assumption for mixed-format tests because there is likely some unique true score variance associated with each item format. Estimates of reliability that assume sections are congeneric or classically congeneric, such as the Feldt (1975) or Feldt-Gilmer (1981) coefficients, respectively, are less restrictive and may be more suitable for mixed-format tests. However, the alpha, Feldt-Gilmer, and Feldt coefficients were expected to produce similar estimates of reliability, and as a result, coefficient alpha was used. Principal component analyses (PCA) on tetrachoric and polychoric correlations were also conducted to evaluate the dimensionality on the main and alternate form datasets. Correlation matrices were computed in SAS (SAS Institute, Version 9.3), while the PCAs were performed using Mplus software (Muthén & Muthén, ). Scree plots of eigenvalues were used to determine the amount of aggregate variance accounted for by each PC or latent dimension (Cattell, 1966). Next, a series of non-linear exploratory factor analyses (EFAs) were conducted to compare solutions with various numbers of factors retained. The main difference between PCA and EFA is that the former uses aggregate variance whereas the latter uses common variance. The test

88 66 specification table was used to determine the maximum number of factors to include in the highest order solution. For example, Chemistry had four main content subdomains and so solutions with 1, 2, 3, and 4 factors were analyzed. Both, orthogonal and oblique rotations were performed to be consistent with the latent trait distributions specified in some of the equating procedures. Two sets of confirmatory factor analyses were run on each form where dimensions represented either content subdomain or item format, as outlined by the test specification table. For all analyses, the Akaike Information Criteria (AIC; Akaike, 1974) and Bayesian Information Criterion (BIC; Schwarz, 1978) were used to assess model fit. The AIC can be computed as, AIC = 2lnL + 2N parm, (3.3) where lnl is the log likelihood and Nparm corresponds to the number of estimated model parameters. The BIC can be computed as, BIC = 2lnL + [ln(n)]n parm, (3.4) where N is the number of examinees. The AIC and BIC can be used to compare the fit of non-nested models, and for both indices, smaller values indicate better fit. The AIC does not take sample size into consideration and tends to favor more complex models, whereas the BIC applies a stronger penalty as the complexity of the model increases. Lastly, the software program, Poly-DIMTEST (Li & Stout, 1995) was used to assess whether the datasets were essentially unidimensional. The test employed by the Poly-DIMTEST program is based on estimating covariances between pairs of items, conditional on a specific subscore. Poly-DIMTEST operates under the same theoretical framework as DIMTEST, but can handle MC and FR item types. The Poly-DIMTEST test statistic, adjusted for bias, was used to test the null hypothesis of essential unidimensionality. In an ideal situation, the results from Poly-DIMTEST would agree with those from the PCA and EFA, thereby making the interpretation of the dimensionality assessments more credible. Regardless, results from all approaches were used jointly to thoroughly assess the dimensionality of the AP Exams.

89 67 Item Calibration and Model Data Fit MC and FR items on each form were calibrated using flexmirt (Cai, 2012). Unidimensional IRT, Bifactor, and full MIRT versions of the 3PL + GR model combination were calibrated for all AP forms. The Bifactor model was estimated once with group-specific factors defined by item format and once by content subdomain. The full MIRT models were calibrated using two- and four-dimensional solutions with correlations between factors fixed at zero or freely estimated. A summary of the studied conditions can be found in Table A7. It should be noted that calibrations were repeated twice first for the Single Population and then for the Matched Samples datasets, therefore the number of conditions is twice of that reported in Table A7. Furthermore, each calibration was conducted using two different item prior settings, which also doubled the number of conditions. Details concerning the item prior settings will be presented later in this chapter. For each condition, MC and FR item parameters were estimated simultaneously. For the unidimensional calibrations, scale indeterminacy was handled by specifying latent ability distributions as univariate standard normal (i.e., θ ~ UVN (0, 1)). For multidimensional calibrations, latent ability distributions for the New and Old Forms were specified in two different ways. First, the latent ability distributions were specified as multivariate normal with means of 0 and standard deviations of 1 for each of the m latent abilities (i.e., θ ~ MVN (0, I)). In this setting the correlation between latent abilities was assumed to be zero. Using the second method, distributions were specified to follow a multivariate normal distribution, but the correlations between dimensions were freely estimated during item calibration. Since forms were calibrated separately, it was possible for the estimated correlations to vary across forms. If the two sets of correlations were found to be very different, this could suggest that groups were not made to be randomly equivalent, thereby violating a basic assumption of the RG design. Performing a rotation of the axes would not provide an exact solution to the correlational

90 68 differences because while the correlation would change, the value to which it changed could not be controlled. Furthermore, rotation changes the mean, standard deviation, and correlations between the latent dimensions, but not the coefficient in the IRF (Reckase, 2009), thereby leaving the marginal observed score distributions unchanged. If correlational differences were found, the estimated correlations could be left unchanged (and remain slightly different) or averages could be taken. When latent traits are assumed orthogonal, only one quadrature distribution is required for equating, because all correlations are fixed at zero. In this dissertation, two separate quadrature distributions (corresponding to the New and Old Forms) were used to estimate the marginal observed score distributions, which were later used in equating procedures. Therefore, the correlations between latent traits were allowed to differ between the New and Old Form groups and no transformations were performed. For the majority of the methods, 25 evenly spaced quadrature points were used during estimation; however, it was necessary to reduce the number of quadrature points to 11 for the BF-content method. This reduction was made only for the estimation of the observed total score distributions, which was conducted after item parameter estimation had completed. Across all models, 25 quadrature points were used during the item parameter estimation phase. Complex structure, as opposed to simple structure, was specified for all full MIRT conditions such that items were free to load on all dimensions. For the Bifactor model, items were free to load on the general dimension and one additional group-specific dimension, resulting in two non-zero discrimination parameter estimates per item. To avoid convergence problems, prior distributions were used to estimate the slope and pseudo-guessing parameters. The use of priors, depending on how restrictive or informative they are, can influence parameter estimation results and potentially affect equating results. Therefore, two different prior settings were investigated. In the first set of priors, referred to as Set I and labeled as (I) in figures, prior distributions for the

91 69 slope and pseudo-guessing parameters were set as lognormal (0, 0.5) and beta (5, 17), respectively, in the UIRT, full MIRT, and Bifactor calibrations. These prior settings are employed in BILOG-MG (Zimowski et al., 2003) and have been found to keep parameter estimates within plausible ranges for UIRT models. The second set of priors, referred to as Set II and labeled as (II) in figures, were used by DeMars (2013), and were found to result in successful convergence and non-negative discrimination estimates with the Bifactor model. In Set II, the c-parameter prior was beta (11, 91), the Bifactor a0 parameters were normal (1.3, 0.8), and the slopes of the UIRT, full MIRT, and Bifactor as parameters were lognormal (-0.4, 0.3). Calibration settings can be found in Table A3 and were selected to be as similar as possible across UIRT and MIRT models in order to reduce random noise. Several fit indices in addition to results from the dimensionality assessments were used to evaluate the goodness-of-fit of the studied (M)IRT models to the AP data. The eigenvalues produced from the PCAs were first examined to judge whether the unidimensionality assumption of the UIRT models was supported. Specifically, Lord s (1980) method of comparing the ratio of the first eigenvalue to the second was used in this study. If the first eigenvalue was large relative to the second, and the second was not relatively large in comparison to the third, this was taken as support that unidimensionality was not severely violated. A non-significant test statistic provided by Poly-DIMTEST was also taken as support that essential unidimensionality held. flexmirt provides several fit indices for dichotomous and polytomous item types, many of which are Chi-square based item-level indices and likelihood-based overall model fit indices. Several of the available indices were used in this dissertation to evaluate overall and item-level model fit. More specifically, inspection of model fit was conducted by examining the AIC and BIC.

92 70 Equating Procedures As mentioned previously, Identity equating and traditional equipercentile equating with presmoothing served as proxies for the population equating relationships for the Single Population and Matched Samples datasets, respectively. Specifically, loglinear presmoothing was conducted in a manner so that random error was reduced, yet the amount of systemic error was not greatly increased. The choice between using pre- and post-smoothing methods was influenced by the fact that presmoothing is currently used operationally with other College Board Exams. In order to determine the optimal degree of smoothing, plots of raw-to-raw equivalents were examined as well as moments for raw scores and unrounded scale scores. Equipercentile procedures with presmoothing were carried out in RAGE-RGEQUATE (Kolen, 2005). The polynomial degree of smoothing will be provided in the Results section. UIRT Observed Score Equating In order to perform UIRT observed score equating, number-correct score distributions need to be estimated for the Old and New Form examinees using a combination of UIRT models, which are then equated using traditional equipercentile methods. A recursive algorithm is used to compute the conditional observed score distributions which are then aggregated to form the marginal distribution. The recursion formula used to estimate score distributions differs depending on whether a dichotomous or polytomous model is specified. For dichotomous items, conditional number-correct score distributions were found by the Lord and Wingersky recursion formula (1984) presented in Chapter 2. In order to find the conditional number-correct score distributions using polytomous items, a modified recursion formula developed by Hanson (1994) and Thissen et al. (1995) is required. Since polytomous items contain more than two response categories, a compound multinomial distribution is used rather than a compound binomial distribution. For a particular item i with K response categories, the probability of earning a score in the k th category is

93 71 f r (U i = W ik θ j ) = P ik (θ j ), (3.5) where r indexes the item being administered, and Wik is the scoring function of the k th category for item i. It should be noted that K can vary for each polytomous item. When r > 1, the probability of a score, x, after administration of the r th item is, K j f r (x θ j ) = f r 1 (x W ik ) P ik (θ j ) for min r < x < max r. (3.6) k=1 Here, minr and maxr represent the minimum and maximum possible scores after addition of the r th item. Once the conditional observed score distributions are found, the marginal observed score distributions are found by applying Formula 2.43 or 2.44, and traditional equipercentile equating is conducted on the marginal distributions. With mixed-format tests, the conditional distributions found after applying the separate recursive algorithms must be combined prior to computing the marginal distribution. Then the marginal ability distributions are found by summing over a univariate ability distribution. The unidimensional IRT observed score equating procedures were conducted in RAGE-RGEQUATE, using the estimated marginal total score distributions for the New and Old Form groups from flexmirt. More specifically, the marginal observed score distributions were inputted as frequencies in the RAGE-RGEQUATE program, along with the observed raw composite score and raw-to-scale conversion table for the Old Form. Traditional equipercentile equating was carried out in the usual manner without smoothing. Bifactor Observed Score Equating The steps involved in Bifactor observed score equating are very similar to those performed in UIRT observed score equating. For example, conditional number-correct scores were estimated and then aggregated to form marginal score distributions, which were then equated using traditional equipercentile methods. However, instead of having number-correct scores conditioned on a single θ, they were conditioned on θg (general ability) and group-specific θs abilities. When considering dimensionality due to item

94 72 format, two group specific dimensions exist that correspond to the MC and FR item formats. When considering dimensionality associated with content subdomain, the number of group specific dimensions equals the number of specific content domains either specified in the test specification table. Using Equations 2.47a or 2.47b, marginal ability distributions were found by summing over a mutually orthogonal trivariate ability distribution when considering item format, or a multivariate distribution of order m + 1 when considering content subdomains. Once the marginal distributions were computed for the main and alternate form subsamples, traditional equipercentile equating was conducted to estimate Old Form equivalents. The Bifactor observed-score equating procedures were conducted by using the estimated marginal total score distributions from flexmirt as input to the RAGE-RGEQUATE equating program. Full MIRT Observed Score Equating Similar to the unidimensional case, performing observed score equating on mixed-format tests under a MIRT framework requires the use of different recursive formulas for MC and FR items. A full MIRT observed score equating procedure for dichotomous items was originally presented by Brossman and Lee (2013) and was applied to MC items, which will be discussed in greater detail. However, the methodology required in order to include polytomous items in MIRT observed score equating will be presented for the first time in this dissertation. To be consistent with item calibration settings, the latent ability distributions were specified as (a) multivariate normal (i.e., θ ~ MVN (0, I)) with fixed zero correlations, or (b) multivariate normal with correlations equal to the estimated values found during item calibration. The translation and dilation indeterminacies were handled in flexmirt by fixing each coordinate axis to have a mean of 0 and standard deviation of 1. As mentioned earlier, under the RG design, a rotation of the axes has been found to have no impact on equating results for orthogonal solutions that specify the ability density as multivariate normal with an Identity variancecovariance matrix (Brossman & Lee, 2013). Furthermore, it has been shown that rotation

95 73 has no influence on response probabilities for non-orthogonal solutions as well (Reckase, 2009). As a result, no transformations were performed to handle rotational indeterminacies under the full MIRT conditions. Estimated correlations for the New and Old Form groups were used to create two separate quadrature distributions (one for each group), that in turn were used to estimate marginal observed score distributions for each group. The estimated marginal observed score distributions were then used to equate the New Form scores to the Old Form scale, using traditional equipercentile methods. The conditional observed score distribution on the MC section is found for each combination of m abilities and is expressed as fr (x θ), where θ represents a vector of m abilities and subscript r is the item index. The probability of scoring a one on the first item is f1(x = 1 θj) = P1, whereas the probability of scoring a zero is f1(x = 0 θj) = (1 P1). For a test with r > 1, the conditional number-correct distributions are computed using a direct extension of the Lord-Wingersky algorithm (Brossman & Lee, 2013) as f r (x θ j ) = f r 1 (x θ j )(1 P r ), x = 0 = f r 1 (x θ j )(1 P r ) + f r 1 (x 1 θ j )P r, 0 < x < r, (3.7) = f r 1 (x 1 θ j )P r, x = r. Equation 3.7 differs from the UIRT version (Equation 2.41) in that θj has replaced θj. The recursion formula continues with r 1 iterations for a test of length r. To obtain the marginal distributions for each form, the conditional distributions are multiplied by the multivariate ability density (ψ(θ)) and integrated over all combinations of the m latent abilities as, f(x) = f(x θ)ψ(θ) d(θ). (3.8) θ 1 θ 2 θ m

96 74 If a discrete multivariate ability distribution is assumed, integrals can be replaced with summation so that Equation 3.8 can be replaced with Equation 3.9, and the marginal distributions are obtained as, f(x) = f(x θ)ψ(θ). θ 1 θ 2 θ m (3.9) For polytomous items, the recursion formulas presented by Hanson (1994) and Thissen et al. (1995) (Equations 3.3 & 3.4) can be extended to the multidimensional framework to estimate the number-correct distributions, conditional on m latent abilities. For a particular item i with K response categories, the probability of examinee j earning a score in the k th category is conditional on a vector of m abilities (θj) and written as, f r (U i = W rk θ j ) = P ik (θ j ), (3.10) where Wrk is the scoring function of the k th category for the r th item. It should be noted that K can vary across the collection of polytomous items. When r > 1, the probability of a score of x after administration of the r th item is, K j f r (x θ j ) = f r 1 (x W rk ) P ik (θ j ) for min r < x < max r. (3.11) k=1 Again, minr and maxr represent the minimum and maximum scores possible after addition of the r th item. Whenever x W rk is less than the minimum score, or greater than the maximum score, then by definition, f r 1 (x W rk ) is zero. The marginal observed score distributions are found by multiplying the conditional distributions by the multivariate ability density (ψ(θ)) and integrating or summing over all combinations of the m latent abilities as, or f(x) = f(x θ)ψ(θ) d(θ) θ 1 θ 2 θ m f(x) = f(x θ)ψ(θ). θ 1 θ 2 θ m (3.12) (3.13)

97 75 Since the AP exams are mixed-format, the conditional observed-score distributions found by use of dichotomous and polytomous recursive formulas need to be summed, conditional on θj, prior to estimating the marginal distributions. Full MIRT (as well as UIRT and Bifactor) equating was implemented by using the estimated marginal total score distributions produced by flexmirt. The marginal observed score distributions were inputted as frequencies in the RAGE-RGEQUATE program, along with the observed raw score and raw-to-scale conversion table for the Old Form. Traditional equipercentile equating was carried out in the usual manner and without smoothing. Evaluation Criteria Several criteria were used to evaluate the performance of the observed score equating procedures investigated in this dissertation. To recap, performance of the observed score full MIRT procedure in comparison to the UIRT and Bifactor procedures was of primary interest. Additionally, traditional equipercentile equating with presmoothing was performed and treated as a baseline comparison for the (M)IRT methods because it has been shown to be reasonably invariant to group differences (Kolen, 1990) and does not make explicit assumptions about the underlying dimensionality of the data. Root Mean Squared Difference The aggregated differences between the (M)IRT procedures and the criterion procedure were evaluated based on a weighted Root Mean Squared Difference (wrmsd) statistic. The wrmsd was used to evaluate the discrepancies between the criterion equivalents and (M)IRT equivalents and is expressed as, Z wrmsd = { w j [eq YIRT (x j ) eq YC (x j )] 2 j=0 } 1/2. (3.14) Here, eq YIRT (x j ) represents the equated equivalent for the raw score j using one of the (M)IRT procedures, eq YC (x j ) represents the criterion equivalent of raw score j, wj is the

98 76 empirical weight associated with that raw score j on the New Form, and Z indexes the maximum raw composite score. Differences That Matter The term, Difference That Matters (DTM) was first introduced by Dorans, Holland, Thayer, and Tateneni (2003) as a way to judge how large a difference needs to be before it can be deemed practically important with regard to scale linking. The DTM can easily be generalized to the evaluation of equating results, which is done in this dissertation. The DTM offers an attractive solution, albeit not perfect, to the situation where one wants to compare several equating methods, but does not have access to the ideal criterion the population equating relationship. This is essentially the situation in the current dissertation such that the best approximation to the population equating relationship has been specified as the equipercentile or Identity relationship, for reasons previously discussed. If the dimensionality assessments provide evidence that unidimensionality does not hold, then the MIRT procedures would be expected to perform more similarly to the criterion method in contrast to the UIRT procedures because they do not make this statistical assumption. However, the extent to which UIRT and MIRT methods differ from the criterion method will depend on the dimensional structure of the AP data. According to Dorans et al. (2003), a DTM is marked by a difference of half a reported score unit, which implies it can be generalized to raw and scale scores. In the current dissertation, differences and absolute differences between the (M)IRT procedures and the criterion procedures at each alternate form score were evaluated using the 0.5 criterion and graphically displayed. This was performed for raw scores and unrounded scale scores. In addition, the weighted average differences for each (M)IRT equating method was evaluated, with weights proportional to the empirical frequencies of the New Form group.

99 77 AP Grade Agreements Another way to think about differences that matter is to ask the question: What matters most to the examinee? For AP examinees, the answer would most certainly be their rounded AP Grades. Since AP Grades are ultimately what colleges and universities use to assign course credit, it was important to emphasize AP Grade agreements between the studied equating methods and the criterion. Therefore, agreement statistics were computed to determine whether the differences observed as a result of using different (M)IRT equating methods carried practical implications for examinees. Exact percent agreements (EPAs) were computed for individual AP Grades (i.e., 1, 2, 3, 4 and 5), and then summed over to get the overall EPAs. More specifically, the formula used to compute overall EPAs was, Overall EPA = s 11 + s 22 + s 33 + s 44 + s 55 S where, saa is the number of New Form raw score points that converted to a grade of a, (3.15) (where a can be 1, 2, 3, 4, or 5) using both the studied method and criterion method, and S is the total number of raw score points on the New Form. Results from the traditional equipercentile and Identity equating methods served as the criteria. Summary In Chapter Three, information was provided concerning the methodologies used to address the research objectives outlined in Chapter One. Procedures related to dimensionality assessments, item calibration, and observed score equating methods were described at length. A large portion was also devoted to describing the data manipulations required in order to approximate the RG design. This approximation was done using two different approaches, which resulted in two different sets of data the Single Population and Matched Samples datasets. The Identity and traditional equipercentile methods were selected as equating criteria for the Single Population and Matched Sample datasets,

100 78 respectively. Results from the dimensionality assessments and equating procedures will be presented in Chapter Four.

101 79 CHAPTER IV RESULTS In Chapter Four, equating results for the UIRT, Bifactor, full MIRT, and traditional equipercentile methods will be presented for Chemistry, Spanish Language, and English Language and Composition. Descriptive statistics for test forms, sample characteristics, and dimensionality assessments are provided prior to presenting equating results. For each subject, two equating criteria were used in an attempt to cross-validate results since true equating relationships are not known. The theoretical Identity relationship served as the first criterion, and was used with the Single Population datasets. The traditional equipercentile method with log-linear presmoothing served as the second criterion, and was used with the Matched Sample datasets. Equating results were evaluated with respect to raw and scale score equating relationships, DTM standards for raw and scale scores, weighted Root Mean Square Differences between equating criteria and the studied methods for raw and scale scores, and AP Grade agreement percentages. Sample and Test Characteristics The data used in the current study were originally collected under the CINEG design and transformed in a manner so that equating under the RG design could be studied. Three subject exams were selected from the 2011 administration of the Advanced Placement (AP) exams: Chemistry, Spanish Language, and English Language and Composition (referred to as English from this point forward). For each subject, two sets of artificial RG data were created, using different sampling methods, in an effort to cross-validate results. The first set is referred to as Matched Samples data, and was created by sampling examinees using ethnicity as a matching variable. Samples were drawn from the original New and Old Forms to create two pseudo-groups that were equivalent with respect to ethnicity composition. Sampling continued without

102 80 replacement until the desired sample size (N = 3,000) was achieved and the effect size for group differences, based on common item scores, was less than ± 0.05 raw score points. The second set of data is referred to as the Single Population dataset and was created by randomly sampling two pseudo-groups of examinees from the original Old Form population. Three thousand examinees were randomly sampled and assigned to pseudo-old and pseudo-new Forms, thus resulting in sample sizes equal to those used with the Matched Sample data. Both sets of artificial data were treated as though they were collected under the RG design. An assumption of the RG design is that the Old and New Form examinees be similar in ability, which was investigated through the computation of an effect size, or standardized mean difference, for common item scores. The effect sizes of the common item scores for the Matched Samples and Single Population data can be seen in Table A8. Even though the effect sizes were less than the target effect size of ± 0.05 raw score points, it was still possible that the distributional shapes of the pseudo-groups differed. Therefore, histograms were used to visually inspect the similarities between the pseudo-groups and can be found in Figures A1 to A6. For the Matched Samples design, the Spanish pseudo-groups appeared similar, but for Chemistry and English, the distributional shapes appeared somewhat different. For the single population design, the distributional shapes of the pseudo-groups appeared very similar across all subject tests. Form Characteristics. AP Chemistry, Spanish Language, and English exams were included in the current study so that results could be compared across a variety of subject areas. Each exam is mixed-format and made up of both multiple choice (MC) and free response (FR) items. Specific characteristics of exam forms can be found in Tables A9 and A10 for the Matched Samples and Single Population datasets, respectively. Based on the frequency histograms in Figures A8 to A13 and the descriptive statistics in Tables A9 and A10, it is apparent that there were far fewer examinees at the upper and lower ends of the composite score scale for each subject. Furthermore, the total

103 81 score variance tended to be larger for the New Forms in the Matched Samples datasets. In order to avoid interpreting equating results that were based on few examinees, a general rule was applied such that results were tabulated over score ranges that had a minimum of 10 examinees. This helped to decrease the amount of random error in the comparison of estimated equating relationships, thereby increasing the fidelity. This resulted in truncated score ranges in comparison to those listed in Tables A9 and A10, but guarded against making unwarranted claims. Since this study is designed to inform research and not for operational use, the decision to truncate seemed reasonable. For the Single Population datasets, the truncated raw score scales for Chemistry, Spanish, and English were , , and , respectively. For the Matched Samples datasets, the raw score scales after truncation were , , and for Chemistry, Spanish, and English, respectively. Test Blueprints. Test specification tables were used to inform confirmatorybased dimensionality assessment procedures and to later assign items to group-specific factors under the Bifactor MIRT model. Items were classified according to format (i.e., MC or FR) and content subdomain and each classification level was considered separately. The test specification tables for Chemistry, Spanish Language, and English can be found in Tables A11 to A13. English was the only exam that had two content subdomains whereas the other subjects had four. This structure is reflected in the Bifactor and full MIRT models such that the maximum number of dimensions is three and two, respectively, for the English New and Old Forms. Spanish Language is unique because content subdomain is nested within item format such that Listening (Content 1) and Reading (Content 2) are in the MC format and Writing (Content 3) and Speaking (Content 4) are in the FR format. Across all subject areas, the New and Old Form exams were built to the same test specification table.

104 82 Dimensionality Assessments The Matched Samples data were used to assess form dimensionality for each subject area. First, principal components analysis (PCA) was performed to test the UIRT assumption of unidimensionality. Scree plots containing the eigenvalues can be seen in Figures A7 to A9 for Chemistry, Spanish, and English, respectively. In general, Chemistry showed evidence of unidimensionality, whereas the scree plots for Spanish and English both suggested that a two dimensional solution may provide better fit. According to Reckase, (1979), the unidimensionality assumption is supported when the first eigenvalue accounts for 20% or more of the total variance, which was found only for Chemistry. Poly-DIMTEST was also used to test the assumption of essential unidimensionality. In poly-dimtest, three different approaches were used. First, a confirmatory approach using item format was run, followed by a confirmatory approach using content subdomain, and lastly, an exploratory approach was run. In order to select the AT1 items for the exploratory approach, an exploratory factor analysis (EFA) was conducted in Mplus. The ten items with the largest loadings on only the first factor were chosen as AT1 items. For the confirmatory approach using content, each of the four content subdomains was treated as AT1, which resulted in four separate runs. The probabilities (i.e., p-value) associated with the Poly-DIMTEST test statistic can be found in Table A14. Probabilities less than 0.05 are interpreted as statistically significant and indicate essential unidimensionality likely does not hold. The Poly-DIMTEST test statistics indicated that essential unidimensionality held for the most part across all exams. Exceptions were seen in a few instances with the Old Form for Chemistry and English and the New Form for Spanish Language. Overall, the p-values associated with the Poly-DIMTEST test statistic widely varied depending on which subset of items defined AT1, and suggest results may be unstable.

105 83 Disattenuated correlations were also used to investigate the dimensional structure of the test forms and values can be found in Table A15. The disattenuated correlations for Chemistry were very close to 1.0 and suggest that the MC and FR item format are likely measuring very similar constructs, thereby supporting the UIRT assumption of unidimensionality. However, the different item formats for Spanish and English are more likely to measure somewhat different constructs as the disattenuated correlations are noticeably smaller. For English, it appears as though item format contributes to the multidimensional structure of the exam slightly more than content subdomain, as the correlations are slightly lower. The disattenuated correlations between content subdomains for Chemistry were generally close to 1.0, which also indicate they measure very similar constructs. Smaller disattenuated correlations were associated with content for Spanish and tended to be smaller for the Old Form. For Spanish, the smallest disattenuated correlation was found between Reading (C2) and Speaking (C4). Across subject exams, the trend observed with the disattenuated correlations agreed with the information provided by the scree plots, such that English and Spanish appeared to have the greatest degree of multidimensionality, relatively speaking, whereas Chemistry showed more evidence of being unidimensional. Model Fit Using the Matched Samples datasets, a series of categorical exploratory and confirmatory factor analyses were conducted to evaluate the structure of the data and the appropriateness of the (M)IRT models used in this dissertation. Across subject exams and for both New and Old forms, a series of 1-, 2-, 3-, and 4-factor EFA models were fitted and AIC and BIC fit indices were computed for each, using MPLUS. Smaller AIC and BIC values indicate better model fit. Furthermore, using item format and content subdomain specifications based on the test blueprints, two CFA models (fit according to item format and content subdomain, considered separately) were fit to each form. The AIC and BIC indices for these analyses can be found in Tables A16 and A17,

106 84 respectively. For both forms of Chemistry, Spanish, and English, the AIC behaved as expected such that it favored the EFA higher dimensional solutions. However, the BIC favored the two dimensional EFA solutions for Chemistry and English and the four dimensional solution for Spanish. The AIC and BIC favored the format-based CFA model for Chemistry and Spanish, but favored the content-based CFA for English (Tables A16 and A17). Considering the EFA and CFA models together, and based on the AIC and BIC values, the highest dimensional EFA solution was favored overall for both forms of Spanish and English. For the Chemistry New Form, the BIC favored the two dimensional EFA model and the AIC favored the format-based CFA, whereas for the Old Form, the four dimensional EFA model was favored by the AIC and BIC. Limitations of the AIC and BIC indices are that they depend on sample size, and there currently are no general guidelines as to what constitutes a large value or how large a difference between two indices needs to be in order to be considered significant. Fitted Distributions Traditional Equipercentile Method. For the Matched Samples datasets, traditional equipercentile equating with log-linear presmoothing was conducted in order to have a criterion with which the (M)IRT equating relationships could be compared. For the Single Population samples, the Identity relationship served as the criterion, but equipercentile equating with presmoothing was included as a way of determining its performance as a criterion for the Matched Samples datasets. If the equipercentile equating relationship was very different from the Identity relationship for the Single Population data, then this could suggest that it is not a good criterion for the Matched Samples data. Presmoothing values were selected based on a series of chi-square difference tests and inspection of raw and fitted distributions. The presmoothing values used in the

107 85 current study can be found in Table A18. For Spanish, slightly different smoothing values were used for the matched sample and Single Population pseudo-groups. The empirical and smoothed distributions for the New and Old Form pseudogroups can be found in Figures A10 to A21. Also in those figures are the fitted form distributions for the UIRT and MIRT models, which will be discussed in greater detail next. For the matched sample Chemistry datasets, the smoothed distributions for the Old and New forms had very different shapes. This was expected seeing that the matched sample Old and New form datasets for Chemistry had quite different observed total score distributions. Overall, the presmoothed distributions appear to fit the empirical distributions very closely for the Old and New forms, across all subjects. UIRT and MIRT Methods. The fitted distributions for the New and Old Forms for the Matched Samples and Single Population datasets can be found in Figures A10 to A21. For the UIRT, Bifactor, and MIRT conditions, additional labels of (I) and (II) accompany each model, indicating whether prior Set I or II were used. For all subjects, the log-linear smoothed distribution used with the traditional equipercentile method appeared to fit the empirical distributions better or equally as well as some of the (M)IRT models, but also appeared less normal. For Chemistry, and for both the Matched Samples and Single Population datasets, the fitted distributions for the BF-content method were very bumpy, which was unexpected. The fitted distributions for the UIRT methods and BF-format methods were similar to one another and seemed to approximate the observed distributions well for both datasets. For the Matched Samples datasets, there were differences within the full MIRT 2D methods such that the 2D orthogonal methods were different from the 2D correlated methods and were instead more similar to the 4D correlated and orthogonal methods. The fitted distributions from the 2D correlated method appeared to fit better than both 4D methods and the 2D orthogonal methods. For the Single Population datasets, the full

108 86 MIRT 2D methods looked similar to one another as did the 4D methods and generally didn t fit as well as the log-linear, UIRT, or BF methods. For the Spanish Matched Samples and Single Population datasets, the log-linear method appeared to fit the observed distributions most closely, but the UIRT and majority of the MIRT methods also appeared to fit well. The fitted distributions from the BF method were very similar to one another, the full MIRT 2D fitted distributions were similar to each other, and full MIRT 4D fitted distributions were similar to one another. One exception was seen with the full MIRT 2D and 4D orthogonal methods when prior Set (I) was used. For these conditions, the fitted distribution for the New and Old Forms was below that of the other (M)IRT models. In general, all methods appeared to fit the observed distributions reasonably well except for the full MIRT 2D and 4D orthogonal models with prior settings (I). For the English Matched Samples and Single Population datasets, the fitted distributions for the log-linear, UIRT, BF, and full MIRT methods seem very similar to one another and each provides a close approximation to the empirical distribution. For the Matched Samples datasets, the total score distribution of the Old Form examinee group is noticeably more peaked in comparison to the New Form group, which may be an artifact of the sampling method. Equating Relationships for Single Population Datasets In the Single Population datasets, the Old and New form groups were sampled from the Main form examinees. The effect of this approach was that the New and Old Form pseudo-groups took the same exam (i.e., the operational Main Form) and thereby negated the need to equate the pseudo New Form group to the pseudo Old Form scale. Therefore, there are no differences in difficulty across forms and theoretically a score on the pseudo New Form should be equivalent to the same score on the pseudo Old Form. This theoretical relationship is known as Identity, where the Old Form equivalents are equal to that of the New Form scores. When differences between the Old Form

109 87 equivalent and New Form scores are displayed graphically, the Identity equating line appears as a zero-difference line across all values of the New Form score scale. For the Single Population datasets, any departures from this zero-difference line can be interpreted as error both random and systematic. The greater departure from the theoretical reference line of zero-difference, the greater the error associated with the studied equating method. The unidimensional and multidimensional IRT observed score equating procedures were carried out in RAGE-RGEQUATE by using the fitted marginal ability distributions from flexmirt. Equating results were evaluated with regard to graphical displays of raw and scale score equating relationships, DTM plots between the (M)IRT method and the criterion Identity method, and weighted RMSDs between the (M)IRT method and the criterion method. Results are broken down by subject area, where Chemistry is presented first, followed by Spanish, and then English. It should be noted that even though the raw composite score scale ranged from 0 to 135, 145, or 156, for English, Spanish, and Chemistry, respectively, results are only provided for raw composite scores that had frequencies of 10 or more examinees. This was done to avoid making interpretations at points along the score scale where there was too little data to be considered reliable. The smoothed traditional equipercentile equating method for each subject was also included to judge its appropriateness as the criterion for the Matched Samples dataset. If the equipercentile relationship showed a large departure from the Identity criterion, this would be taken as evidence against its use as a criterion in the Matched Samples dataset. Chemistry. Figures A22 to A25 depict the equating relationships for raw scores for Chemistry using the equipercentile and UIRT, Bifactor (BF), full MIRT 2D, and full MIRT 4D equating methods. In these figures, results using both prior settings for the item parameters (referred to as (I) or (II)) are displayed simultaneously and show only minor

110 88 differences. Therefore, results using just prior setting (II) for the (M)IRT methods plus the results for the traditional equipercentile method are provided in Figure A26. The two reference lines at ± 0.5 indicate the DTM for Old Form equivalents. The equating relationships for the UIRT and BF-Format methods were most similar to the Identity criterion, while the traditional equipercentile method was most different. Surprisingly, the relationship pattern for the BF-Content method was very different from the other methods, but was still within the DTM boundaries across the majority of the score scale. The pattern for the equating relationships was similar across (M)IRT methods, with the exception of the BF-Content method. The equating relationships for scale scores can be found in Figures 27 to 30 for Chemistry using the equipercentile and UIRT, Bifactor, full MIRT 2D, and full MIRT 4D equating methods. The general patterns of the equating relationships were the same as those found for the raw score equating relationships, except that differences were less pronounced due to the less precise score scale. All methods except for the traditional equipercentile method fall within the DTM boundaries. Figure 31 provides a more direct visual comparison across all equating methods using only prior setting (II) results. Spanish Language. Figures 32 to 35 depict the equating relationships for raw scores for Spanish using the equipercentile and UIRT, Bifactor, full MIRT 2D, and full MIRT 4D equating methods. In these figures, results using both prior settings for the item parameters (referred to as (I) or (II)) are displayed simultaneously. The prior settings seemed to affect the raw score equating relationships for the UIRT, Correlated 2D, and Correlated 4D methods. Even though prior settings seemed to make a difference with some methods, results using just prior setting (I) for the (M)IRT methods plus the results for the traditional equipercentile method are provided in Figure A36 to provide a more direct comparison across methods. Again, the two reference lines at ± 0.5 indicate the DTM standard for Old Form equivalents. The equating relationship for the full MIRT orthogonal 4D method was most similar to the Identity criterion, and fell within the DTM

111 89 boundaries for raw and scale scores across the entire score scale. The method most different from the Identity criterion was the traditional equipercentile and UIRT methods. The equating relationship patterns for the remaining (M)IRT methods were very similar to one another but more curvilinear in comparison to the criterion. The equating relationships for scale scores can be found in Figures A37 to A40 for Spanish for the equipercentile and UIRT, Bifactor, full MIRT 2D, and full MIRT 4D equating methods. The general patterns of the equating relationships are very similar across all methods. For scale scores, the full MIRT orthogonal 4D method came closest to the Identity relationship and was within the DTM boundaries across the entire score scale. The scale score equating relationships for the remaining methods were within the DTM boundaries for raw composite scores of approximately 70 or greater. Figure A41 provides a more direct visual comparison across all equating methods and provides results from when prior settings (I) were used with the (M)IRT calibrations. English Language and Composition. Figures A42 to A44 depict the equating relationships for raw scores for English using the equipercentile and UIRT, Bifactor, and full MIRT 2D equating methods. In these figures, results using both prior settings for the item parameters (referred to as (I) or (II)) are displayed simultaneously. Across all (M)IRT methods, the use of different prior settings did not seem to affect the estimated equating relationships for raw scores. Estimated equating relationships resulting from the use of prior settings (I) for the (M)IRT methods plus the results for the traditional equipercentile method are provided in Figure 45 to provide a more direct comparison. Again, the two reference lines at ± 0.5 indicate the DTM for Old Form equivalents. The equating relationship for the Bifactor and full MIRT methods were most similar to the Identity criterion, but fell outside the DTM boundaries for the lower half of the score scale. The traditional equipercentile method was most different from the Identity criterion, but the discrepancy was not as large as it was with Chemistry and Spanish.

112 90 Overall, the equating relationship patterns for the (M)IRT methods were very similar to one another and not that different from the traditional equipercentile pattern. The equating relationships for English scale scores can be found in Figures A46 to A48 for the equipercentile and UIRT, Bifactor, and full MIRT 2D methods. The general patterns of the equating relationships are very similar across all methods, with the exception of the traditional equipercentile method at the upper end of the score scale. The scale score equating relationship for all (M)IRT methods were within the DTM boundaries, with the exception of the full MIRT 2D methods, which fell outside the DTM boundaries for raw composite scores greater than 95. The equating relationship for the traditional equipercentile method fell outside the DTM boundaries for raw composite score between 25 and 40 and between 112 and 115. Figure A49 provides a more direct visual comparison across all equating methods and provides results from when prior settings (I) were used. Equating Relationships for Matched Sample Datasets The Matched Sample datasets were created by sampling examinees from the original Old and New forms using ethnicity as a stratification variable. Pseudo Old and New form groups were created using an iterative sampling procedure that continued until the standardized mean difference in common item scores between the two pseudo-groups was less than ± 0.05 raw score points. The equating criterion used with the matched sample datasets was the traditional equipercentile method with presmoothing. The unidimensional and multidimensional IRT observed score equating procedures were carried out in RAGE-RGEQUATE by using the fitted marginal ability distributions provided by flexmirt. Equating results were evaluated with regard to graphical displays of raw and scale score equating relationships, difference plots between the (M)IRT method and the criterion method, and weighted RMSDs between the (M)IRT method and the criterion method. Results are broken down by subject, so that Chemistry is presented first, followed by Spanish, and then English.

113 91 Chemistry. The estimated raw score equating relationships for Chemistry using the UIRT, Bifactor, full MIRT 2D methods, and full MIRT 4D methods can be found in Figures 50 to 53. These figures represent the equating relationships only and unlike with the Single Population datasets, do not allow for comparisons with the DTM standard. The equating relationships for the BF-Content methods were slightly bumpy, but did not differ much from the BF-Format methods. Across all (M)IRT methods, the choice of item prior did not seem to make a large difference in the estimated equating relationships. The equating relationships across all methods can be seen in Figure 54 where (M)IRT results reflect those obtained using prior setting (II). In general, the MIRT methods appeared to have the same equating patterns, whereas the patterns for the UIRT and traditional equipercentile methods were very different from the MIRT methods and from one another. Figures 55 to 58 display the DTM for raw score for Chemistry where the y-axis represents the difference between the (M)IRT Old Form equivalent and the smoothed traditional equipercentile Old Form equivalent. The area outside of the two solid reference lines corresponds to differences that are greater than ± 0.5 raw score points or the DTM standard. In general, the MIRT methods all pass through the inside of the DTM boundaries but do so only near raw scores of 30, 90, and 135, and are not consistently within the boundaries at any score ranges. The UIRT method is most different from the equipercentile criterion and always falls below the DTM boundaries. The DTM comparisons for all methods (using prior setting II) can be found in Figure 59. Among the MIRT methods, there are two groupings of equating relationships where the first group includes both Bifactor methods and the full MIRT correlated 2D method and the second group includes both full MIRT 4D methods and the full MIRT 2D orthogonal method. Figures 60 to 63 depict differences for scale scores where the y-axis now represents the difference between the (M)IRT Old Form scale score equivalents and the traditional equipercentile scale score equivalents, and the x-axis represents the New Form

114 92 raw composite score. The general pattern for the differences in scale score equivalents is similar to the differences in raw score equivalents, except that the MIRT methods fall within the DTM boundaries for larger portions of the score scale. Figure 64 displays the equating relationships for all studied methods using prior setting (II). From this figure, it is obvious that the UIRT method is still outside of the DTM boundaries along the entire score scale, except near the upper region, and is most different from all other methods. Spanish Language. The estimated raw score equating relationships for Spanish can be found in Figures 65 to 68 for the various (M)IRT methods and in Figure 69 for all methods, but using prior (I) settings only. The general pattern for the estimated equating relationships was similar across all (M)IRT methods and the traditional equipercentile method. However, the relationships for the full MIRT Orthogonal 4D, Orthogonal 2D, and Correlated 4D methods were noticeably more similar to one another. Furthermore, the relationships for the full MIRT Correlated 2D, BF-content, BF-format, UIRT, and traditional equipercentile methods were most similar to one another. The choice of prior settings appeared to affect the estimated equating relationships for the MIRT methods (and to a lesser extent for the BF methods), but not for the UIRT method. Differences between the Old Form raw score equivalents from the traditional equipercentile method and the (M)IRT methods can be found in Figures 70 to 73 for the UIRT, Bifactor, and full MIRT methods, respectively. The same differences can be found for all methods, using only prior settings (I), in Figure 74. From Figure 74, it can be seen that the differences between the equipercentile and the UIRT, BF-Content, BF-Format, and full MIRT Correlated 2D methods are generally less than the DTM standard of ± 0.5 points. Actually, the UIRT method was the only equating procedure that remained inside the DTM boundaries across the entire scale. However, the differences between the equipercentile method and the full MIRT 4D methods and full MIRT 2D orthogonal method are greater than the DTM standard across the entire score scale.

115 93 Differences between the Old Form scale score equivalents from the traditional equipercentile method and the (M)IRT methods can be found in Figures 75 to 78 for the UIRT, Bifactor, and full MIRT methods, respectively. The same differences can be found for all methods, using only prior setting (I), in Figure A79. The same trend as was found using the DTM standard for raw scores was also found for scale score equivalents, with a few exceptions. First, the full MIRT correlated 2D method was no longer entirely within the DTM boundaries, particularly at the lower and upper score ranges. Second, the general pattern for the full MIRT 2D correlated method mirrored those of the other full MIRT methods. This was actually found for the raw score differences, but was less pronounced, and thus harder to identify. Third, the full MIRT 4D methods and orthogonal 2D method are within the DTM boundaries for New Form scale scores near the middle of the scale, whereas differences were greater than the DTM standard for raw scores along the entire score scale. English Language and Composition. The estimated raw score equating relationships for English can be found in Figures A80 to A82 for the various (M)IRT methods and in Figure A83 for all methods, but using prior setting (I) only. The general pattern for the estimated equating relationships were similar across all (M)IRT methods and the traditional equipercentile method. It is important to note that four dimensional MIRT methods were not included for English because the test blueprints indicated two content subdomains and two item formats. The choice of item prior settings did not seem to influence the UIRT, Bifactor, or full MIRT 2D raw score equating relationships. Differences between the traditional equipercentile method and the (M)IRT methods can be found in Figures A84 to A86 for the UIRT, Bifactor, and full MIRT 2D methods, respectively. In general, the BF-Format performed most differently from the traditional equipercentile method and had differences greater than the DTM standard for Old Form raw scores between 55 and 115. Differences between the traditional equipercentile method and all MIRT methods were greater than the DTM standard for

116 94 New Form raw scores near 25, but then were generally within the DTM boundaries for the remainder of the score scale, with the exception of the BF-Format method (Figure A87). Differences between the scale score equivalents from the traditional equipercentile method and the UIRT, Bifactor, and full MIRT methods can be found in Figures A88 to A90, respectively. The same differences can be found for all methods, using only prior setting (I), in Figure A91. The same general trend was observed for scale score equivalents as was found for raw scores, except that the differences were less pronounced due to the less precise score scale. For example, the differences for the BF- Format method were generally less than the DTM standard. One difference between the raw and scale score difference plots is that the UIRT method produced scale score equivalents that were greater than the DTM standard at the lower end of the score scale, whereas all differences in raw score equivalents were within the DTM boundaries. Weighted Root Mean Square Differences for Old Form Equivalents In addition to graphical comparisons, a more quantifiable measure of the differences between the traditional equipercentile, UIRT, Bifactor, full MIRT observed score equating methods and the respective criterion method was found by computing a weighted Root Mean Square Difference (wrmsd) statistic for Old Form raw and scale score equivalents. For the Single Population datasets, the Identity equivalent served as the criterion whereas for the Matched Samples dataset the smoothed traditional equipercentile equivalent served as the criterion. Differences were weighted according to the frequency distributions of the sampled New Form pseudo-groups. Smaller wrmsd values indicate greater similarity between the (M)IRT method and respective criterion method. Single Population Datasets. For the Single Population datasets, the wrmsd for raw and scale scores for Chemistry, Spanish, and English can be found in Tables A19 and A20, respectively. The smallest wrmsd values for Chemistry raw and scale score

117 95 equivalents were found for the UIRT and BF-Format methods, whereas the largest wrmsd was found for the traditional equipercentile method. This finding agrees with the difference plots presented earlier, where the UIRT, BF-Format methods were within the DTM boundaries across the entire score scale. In general, there was a positive relationship between difference plots and wrmsds, for each subject. The smallest wrmsd for Spanish raw and scale score equivalents was found for the full MIRT 4D orthogonal method and the largest corresponded to the traditional equipercentile method, followed closely by the UIRT and both BF methods. The smallest wrmsds for English raw scores were found for the full MIRT 2D methods. Whereas the smallest wrmsds for English scale scores were found for the Bifactor and full MIRT methods, and were very similar in magnitudes. For English, the largest wrmsds for raw and scale scores were found for the UIRT method. Differences related to prior distribution settings were more apparent in the raw score wrmsds for all subjects. For Chemistry, the greatest differences between prior distribution settings were seen for the BF-Content and full MIRT 4D correlated methods. For Spanish, the greatest differences resulting from the use of different prior settings were found for the BF-Format and full MIRT 2D correlated methods. For English, the greatest difference was found for the BF-Content method. Across all subject areas, when the differences between prior settings were among the largest, the use of prior setting (I) led to smaller wrmsd values with the exception of the BF-Content method for Chemistry. Matched Sample Datasets. For the Matched Samples datasets, the wrmsds for raw and scale scores for Chemistry, Spanish, and English can be found in Tables A21 and A22, respectively. The smallest wrmsd values for Chemistry raw and scale score equivalents were found for the BF-Format, BF-Content, and full MIRT 2D correlated methods, whereas the largest wrmsd was associated with the UIRT method. For Spanish raw score equivalents, the smallest wrmsd was found for the UIRT and full

118 96 MIRT 2D correlated method and the largest corresponded to the full MIRT 4D orthogonal method. However, for the Spanish scale score equating relationships, the smallest wrmsds corresponded to the UIRT, BF-Format, and BF-Content methods, and the largest still corresponded to the full MIRT 4D orthogonal method. For English raw and scale score equivalents, the smallest wrmsds were found for the UIRT and BF- Content methods and the largest was found for the BF-Format method. For the Matched Samples datasets, and similar to what was found with the Single Population datasets, differences related to prior distribution settings were more apparent for raw score wrmsds, across all subjects. For the English datasets, the use of different prior settings didn t seem to result in large wrmsd discrepancies. However, for Spanish, the discrepancies between wrmsds for raw score equivalents were rather large for the BF-Format, BF-Content, and full MIRT 4D correlated methods. These discrepancies also carried over to the wrmsds for Spanish scale score equivalents. For Chemistry, the greatest differences in raw score equivalents related to prior distribution settings were seen for the full MIRT correlated 2D and 4D methods, however these differences did not carry over to scale score equivalents. AP Grade Agreements Since the data used in this dissertation were originally collected with the intent to classify students into one of five grade levels through the use of cut scores, it was important to look at classification agreements among the traditional equipercentile, UIRT, Bifactor, and full MIRT observed score equating methods. For the Single Population datasets, agreements between the (M)IRT and traditional equipercentile method and the Identity equating method were computed. Whereas, for the Matched Samples datasets, agreements were computed between the (M)IRT methods and the traditional equipercentile method (which served as the criterion). For each studied equating method, the overall exact percent agreement was computed as the percentage of New Form raw scores that equated to the same Old Form AP Grade equivalent as was

119 97 found for the respective criterion method. The percentage was unweighted and taken across the entire score scale which for Chemistry was over 157 raw score points, for Spanish was over 146 score points, and for English was over 136 score points. It is important to note that AP Grade equivalents are found by transforming the Old Form raw score equivalents to AP Grades using the Old Form Raw-to-Grade conversion table. As a result, scale scores are not required to assign students to AP Grade levels. Single Population Datasets. For the Single Population datasets, AP Grade agreement percentages were generally high and ranged between 97.1% and 100% and can be found in Table A23. For all subjects, the traditional equipercentile method resulted in the lowest (albeit still relatively high) AP Grade agreement percentages. For Chemistry, the UIRT, Bifactor, and full MIRT 2D methods all resulted in perfect agreement with the Identity criterion. For Spanish, perfect agreement was obtained for the full MIRT 4D orthogonal method, and in general the full MIRT methods had higher agreements in comparison to the UIRT and Bifactor methods. For English, the UIRT and MIRT methods resulted in the same agreement percentages (i.e., 99.1%), whereas the agreement percentage for the traditional equipercentile method was slightly lower (i.e., 98.8%). Matched Samples Datasets. For the matched sample datasets, AP Grade agreement percentages were more variable in contrast to the Single Population datasets and ranged between 90.8% and 100% (Table A24). It should be noted that agreements were computed such that the smoothed traditional equipercentile method served as the criterion. For Chemistry, the Bifactor methods, along with the full MIRT correlated 2D and 4D methods (when prior setting II was used) generally had the highest agreement percentages. For Spanish, perfect agreement was obtained for the full MIRT 2D correlated (using prior setting II), and BF-Content (using prior setting I) methods. For English, the UIRT, BF-Content, and full MIRT 2D correlated (using prior setting II) methods resulted in perfect agreements with the traditional equipercentile criterion.

120 98 Summary In this chapter, equating results for the UIRT, Bifactor, full MIRT, and traditional equipercentile methods were presented for Chemistry, Spanish Language, and English. For each subject, two equating criteria were used in an attempt to cross-validate results since true equating relationships were unknown. The theoretical Identity relationship served as the first criterion, and was used with the Single Population datasets. The traditional equipercentile method with log-linear presmoothing served as the second criterion, and was used with the Matched Sample datasets. Results were evaluated with respect to raw and scale score equating relationships, DTM standards for raw and scale scores, weighted Root Mean Square Differences between equating criterion and the studied methods for raw and scale scores, and AP Grade agreement percentages. Within each subject, equating results varied depending on which equating criterion was used. For the Single Population datasets, the methods that performed the closest to the Identity criterion were the UIRT and BF-Format methods for Chemistry, the full MIRT orthogonal 4D method for Spanish, and both full MIRT 2D methods for English. The traditional equipercentile method performed most differently from the Identity criterion for Chemistry and Spanish, and was among the most different for English. For the Matched Sample datasets, the methods that performed the closest to the traditional equipercentile criterion were the BF-Content, BF-Format, and full MIRT correlated 2D methods for Chemistry, the UIRT and full MIRT correlated 2D method for Spanish Language, and the UIRT and BF-Content method for English.

121 99 CHAPTER V DISCUSSION The purpose of this dissertation was to build upon the existing MIRT equating literature by introducing a full MIRT observed score equating method for mixed-format tests and compare its performance with the traditional equipercentile method with loglinear presmoothing, UIRT observed score, and Bifactor observed score equating procedures. In the past few years, an increasing interest in MIRT has emerged in the literature due to evidence suggesting the interaction of examinees with test items is likely governed by more than one latent ability. Therefore, it was important to investigate the costs and benefits associated with using a more complex equating method in comparison to methods that are more parsimonious. Specifically, the research objectives were to: 1. Present observed score equating methods for mixed-format tests using a full MIRT modeling framework. 2. Compare the differences in equating results from using full MIRT, Bifactor, UIRT, and traditional equipercentile methods. 3. Compare differences in equating results for the full MIRT method when the correlations between latent traits were specified as mutually orthogonal versus when they were freely estimated. 4. Examine whether dimensionality associated with item format versus content subdomain influenced the relationships among the four equating methods in a similar manner. 5. Determine how the severity of the violation of unidimensionality, as measured by a combination of dimensionality assessment procedures, affects equating results. In addition, the effects of using different item prior distribution setting during item calibration was also investigated, but was not a focus of this dissertation. The use of

122 100 item priors becomes more relevant and more necessary as the number of estimated parameters increases, such as with full MIRT models. In this chapter, a discussion surrounding the results from Chapter Four is presented and begins with a focus on the importance of selecting an appropriate equating criterion when research involves real data. For any comparative study, there is a need for a criterion or baseline to which studied methods can be compared. The importance of this criterion cannot be overstated, and affects all subsequent interpretations, which is why it is discussed first. Next, a discussion concerning the dimensionality of the AP Exams is presented, and is followed by a detailed comparison of the studied equating methods. Following the general comparisons, specific discussion corresponding to Research Objectives two to five are presented. A discussion of the limitations embedded in the current dissertation, as well as future directions are presented last. Importance of the Equating Criterion Since all analyses involved real data in this dissertation, the comparisons between the studied equating methods were made with respect to an equating criterion. When real data are used, the true equating relationships can never be known and therefore plausible substitutions for the truth must be made. For this dissertation, substitutions took the form of theoretically expected relationships and those that were considered unbiased with respect to dimensionality assumptions. For the Single Population datasets, the theoretical Identity equating relationship served as the criterion, whereas for the Matched Samples datasets, the presmoothed traditional equipercentile relationship served as the criterion. The purpose of adding the Identity equating criterion was threefold: (1) it allowed for findings to be cross-validated using separate criteria, (2) it provided a way to evaluate the performance of the traditional equipercentile method as a criterion, and (3) the manner in which the Single Population datasets were formed (random sampling) provided a way to evaluate how well the pseudo-groups were sampled in the Matched Samples datasets by using a matching variable.

123 101 Across all AP subjects, the conclusions made when the Identity relationship served as the criterion versus when the presmoothed traditional equipercentile relationship served as the criterion, were completely different from one another. The comparisons made with the Identity criterion agreed with the overall findings from the dimensionality assessments. For example, the dimensionality assessments suggested that the AP Chemistry forms were likely unidimensional, and it turned out that the UIRT and Bifactor methods performed most similarly to the Identity criterion in comparison to the full MIRT methods. Furthermore, the Spanish and English forms showed evidence of being more multidimensional, and among the studied equating relationships, the UIRT method performed most differently from the Identity criterion for these subjects. The congruence between the dimensionality assessments and the equating results obtained by using the Identity criterion provide support for the use of the Identity criterion. Whereas, comparisons made with the traditional equipercentile criterion did not agree with the dimensionality assessments nor with the equating results obtained by using the Identity criterion. This was most likely because, in the Single Population datasets, the traditional equipercentile equating relationships were typically most different from the Identity criterion. Since the Identity criterion was used in a situation where equating was not necessary, any departures from it could be attributable to random error and/or systematic error. However, some departures from the Identity relationship was expected due to random sampling error, but large departures such as those seen with the traditional equipercentile method may indicate that it contained more error. It is important to point out that, for the Single Population datasets, even though both groups were sampled from the same population, there was evidence of slight group differences. Differences were encountered due to the use of samples as opposed to populations and therefore, it was expected that the Identity relationship was not a perfect criterion. However, this does not imply a problem with the Identity criterion, but instead can better be seen as a sampling issue. As a result, small departures from the Identity

124 102 equating line may not be indicative of random or systematic error but, instead, may represent actual group differences. Therefore, it cannot be said with certainty that departures from the Identity equating line reflect error. Even though the traditional equipercentile method was not a perfect substitute for the true equating relationship, it was the best available option for the Matched Samples datasets and was useful for making comparative evaluations. With this said, making statements in regard to the accuracy or superiority of one equating method over another was generally avoided, and may be an applicable guidelines for other studies involving real data. Since the traditional equipercentile method performed much differently than the Identity criterion, it was difficult to make general comparisons among the studied methods using both criteria simultaneously. For this reason, and because of the congruence between the dimensionality assessments and the equating results using the Identity criterion, more emphasis is placed on results obtained using the Single Population datasets and Identity criterion. However, both sets of results are considered during the discussion of research objectives two through four, since the traditional equipercentile criterion was still considered useful for making comparisons among the studied equating methods. Effect of Violating the IRT Unidimensionality Assumption Research Objective # 5 Several different procedures were used to assess the dimensionality of the AP Matched Samples datasets in order to determine whether the UIRT assumption of unidimensionality held. These procedures included principal component analyses (PCAs) on polychoric correlation matrices, categorical exploratory factor analyses (EFAs), the computation of disattenuated correlations between item formats and between content subdomains, and use of the Poly-DIMTEST test statistic to test for essential unidimensionality. It was expected that findings using the Matched Samples would be

125 103 similar to those using the Single Population datasets due to the overlap in items and similar sample sizes. Therefore, dimensionality assessments were conducted using the Matched Samples datasets only, but results were treated as applicable to both datasets. Some discrepancies among the dimensionality assessment techniques emerged, but in general, it appeared as though Chemistry was largely unidimensional whereas Spanish and English showed more evidence of being multidimensional. For the Chemistry Old and New Forms, the first eigenvalue accounted for more than 20% of the total variance, the disattenuated correlations were all relatively close to one, and the majority of the test statistics from Poly-DIMTEST were non-significant (indicating that essential unidimensionality wasn t violated). For the Spanish and English forms, the first eigenvalue accounted for less than 20% of the total variance, the majority of the disattenuated correlations were in the mid 0.80 s, and the results from Poly-DIMTEST were mixed (although majority were non-significant). At this point, it is worth mentioning that results from Poly-DIMTEST were highly variable depending on how the items were parsed. With such variable results, the question becomes which to rely on, which does not have a definitive answer. However, in this dissertation that decision was avoided by simultaneously considering all results, but it is acknowledged that this approach may not be practical in operational settings. Due to the variable nature of the Poly-DIMTEST test statistic, more emphasis was placed on the results from the other dimensionality assessments when determining the dimensional structure of the AP forms. In order to have a more straightforward discussion of the effect of violating the UIRT unidimensionality assumption on the estimation of equating relationships, only the Single Population datasets are referenced here. This decision was based on two findings: (a) the traditional equipercentile method often performed most differently from the Identity relationship, and (b) the equating relationships from the Single Population datasets are in agreement with findings from the dimensionality assessments. This does

126 104 not exclude all findings associated with the Matched Sample datasets as they will be included in sections that follow. For Chemistry, three distinct patterns emerged in the estimated equating relationships for raw and scale scores. For example, the UIRT, BF-Format, and full MIRT methods led to similar raw and scale score equating relationships patterns, whereas the relationships for the BF-Content and traditional equipercentile methods had quite different patterns. The UIRT, BF-Format, and two-dimensional full MIRT correlated methods performed the closest to the Identity criterion. The fact that these models are either unidimensional or less complex multidimensional models and agreed more with the criterion relationship agrees with the finding that Chemistry is more unidimensional. However, all methods except for the four-dimensional full MIRT and traditional equipercentile resulted in perfect AP Grade agreement percentages, which suggests the importance of the score scale. It may be the case with unidimensional exams that even when the fitted model overestimates the number of dimensions, the equating results may be very comparable to those found by fitting a unidimensional model. In some respects, unidimensional models can be thought of as special cases of multidimensional models. Therefore, for unidimensional data, there may be little differences between UIRT and MIRT equating methods. For example, if a multidimensional model were fit to unidimensional data, the additional discrimination parameters would be freely estimated and would likely be small. However, this sort of secondary safety net does not exist in situations where unidimensional models are fit to multidimensional data because there are no extra parameters in the model that can pick up on additional latent dimensions that exist beyond the first dimension. Therefore, even when weak multidimensionality is suspected, it may be beneficial to overestimate the dimensional structure rather than underestimate it, assuming there is not a significant increase in estimation error. The potential benefit of slightly overestimating the number

127 105 of latent dimensions can be nullified, if in doing so, a large amount of estimation error is introduced. For tests like Spanish and English, where the unidimensionality assumption was likely violated, full MIRT observed score equating methods may provide an acceptable alternative to UIRT methods. For Spanish, the four-dimensional full MIRT model with orthogonal traits performed most similarly to the Identity criterion, whereas the twodimensional full MIRT model with trait correlations freely estimated, performed the closest to the criterion for English. Based on test blueprints, the highest dimensional structure expected for Spanish forms was a four factor solution and for English forms was a two factor solution, which agrees with the equating results. However, the PCA results for both forms of English and Spanish indicate a two-factor structure is more likely, which doesn t agree with the equating result for Spanish. This finding may suggest that results from PCAs may not be suitable for mixed-format exams, even if conducted on polychoric correlations. For English, the UIRT and traditional equipercentile methods performed most differently from the criterion and had the largest wrmsds for raw and scale scores; however, there was very little differences in AP Grade agreements among all studied methods. For Spanish, the raw score equating relationship from the UIRT method performed quite differently from the criterion method, but still resulted in a high AP Grade agreement percentage. Therefore, the consequences of possibly underestimating the complexities of the data matrices depends on the score scale being used to make decisions. If the score scale is less precise, such as the case with the AP Grade scale, then the choice between UIRT and MIRT equating methods may have fewer consequences than if raw scores (which are more precise) were used for reporting purposes.

128 106 General Comparison of Equating Methods Research Objective #2 A basic and all-encompassing objective of this dissertation was to compare the full MIRT observed score equating methods to the traditional equipercentile, UIRT observed score, and Bifactor observed score methods. Comparisons in this section are made with respect to the estimated equating relationships for raw scores, unless stated otherwise. For the Single Population datasets, equating relationships also represent differences between the studied method and Identity criterion and therefore equating relationships and DTM comparisons are discussed together for these datasets. For the Matched Sample datasets, comparisons are made with respect to only the estimated equating relationships for raw scores. In order to avoid redundancy, estimated equating relationships for scale scores are not included because they generally followed the same patterns as the raw score relationships. Since conclusions depend on the equating criterion (and dataset), comparisons made with the Single Population datasets are discussed first, and are followed by those made with the Matched Samples datasets. Single Population Datasets The estimated equating relationships obtained with the Single Population datasets were expected to behave as if there were no form or group differences. Slight deviations were expected due to random sampling error, but in general, the equating relationships should have appeared similar to the Identity relationship. For the most part, the patterns of the estimated equating relationships appeared to take on the same shape as the Identity relationship, but some were bumpier than expected. Differences in equating relationships between the studied methods and the Identity criterion were considered large if they were greater than the DTM standard of ± 0.5 raw score points. For Chemistry, the estimated equating relationships took on the same general shapes and were quite similar to one another, with the exception of the BF-Content and

129 107 traditional equipercentile methods. The patterns for BF-Content and traditional equipercentile methods were not similar to any other method nor were they similar to one another. Differences between the estimated equating relationships and the Identity relationship were generally not greater than 0.75 raw score points, except for the traditional equipercentile method which showed differences as large as 1.25 raw score points on the Old Form scale. The methods that performed most similar to the Identity relationship and that had differences all within the DTM boundaries were the UIRT, BF- Format, and full MIRT two-dimensional with trait correlations freely estimated. These methods used models that were either unidimensional or had fewer dimensions (and thus fewer parameters), and so it seems plausible that they performed closest to the criterion since Chemistry appeared unidimensional. For Spanish, the estimated equating relationships were similar to one another, with the exception of the full MIRT four-dimensional orthogonal and traditional equipercentile methods. The full MIRT four-dimensional orthogonal relationship stood apart from all other methods most noticeably, was most similar to the Identity relationship, and was the only method to remain within the DTM boundaries across the entire score scale. However, the pattern of item discriminations on both forms did not agree with the test specifications with regard to content subdomains, which had four levels. The traditional equipercentile relationship was the most curvilinear and among the most different from the Identity relationship, which was unexpected. The differences seen with the traditional equipercentile method are less likely to be due to group differences since the shape of the ability distributions were very similar for the Old and New Form groups. Also, the UIRT relationship was considered similar to the Bifactor and full MIRT (excluding the 4D orthogonal method) relationships for the majority of the score scale, but not at lower score ranges. Specifically, the UIRT relationship was lower than the Identity relationship by approximately two raw score points on the Old Form scale for lower score ranges. This last finding may indicate that examinees with lower abilities are

130 108 more variable with respect to the latent dimensions being measured by the test items, in comparison to higher ability examinees. For English, the equating relationships for the Bifactor and full MIRT methods were generally very similar to one another, while the relationships for the UIRT and traditional equipercentile showed somewhat unique patterns. The UIRT relationship followed the same pattern as the MIRT methods but was slightly more curvilinear and hence differed more from the Identity relationship in comparison to the MIRT methods. The relationship for the traditional equipercentile method was the most curvilinear of all the methods and was most different overall from the Identity relationship. While none of the studied methods produced equating relationships that were consistently within the DTM boundaries, all differences were within approximately ± 1.0 raw score points. Matched Samples Datasets For the Matched Samples datasets, the shapes and placements of the estimated equating relationships were used for comparing the studied equating methods. Similar conclusions were drawn from the estimated equating relationships for Chemistry and Spanish, but not English. For Chemistry, the estimated equating relationships for all Bifactor and full MIRT methods had the same general shape, however there were two clear groupings. In the first group were the estimated equating relationships for both fourdimensional full MIRT models and the two-dimensional orthogonal full MIRT model. In the second group were both Bifactor methods and the two-dimensional correlated full MIRT method. The equating relationships included in the second group were always above those in the first group. At this time, it is difficult to determine what caused for there to be two general groupings; however, it is apparent that the Bifactor and correlated two-dimensional full MIRT model represented the lower dimensional methods. Lastly, for Chemistry, the shapes of the UIRT and traditional equipercentile relationships were different from the Bifactor and full MIRT methods and were also different from one another. The fact that these last two methods produced quite different equating

131 109 relationships for a rather unidimensional exam, may indicate that group differences were too large (which was seen in the total score distributions) for the equipercentile method to perform adequately. The equating relationship patterns found with Spanish were very similar as those found with Chemistry. For example, the same two groupings of equating relationships found with Chemistry were also found with the Spanish exams, but now, the traditional equipercentile and UIRT relationships had very similar patterns as the Bifactor and twodimensional correlated full MIRT relationships. Therefore, for Spanish, there were two clear groupings and the UIRT and traditional equipercentile method fit with the second group, which was not found with Chemistry. This may be related to the fact that the shapes and moments of the total score distributions were very similar for the Spanish Old and New Forms, and were less similar between the Chemistry Old and New Forms. Also similar to the Chemistry exams, the relationships for the second group were all above the relationships in the first group. Therefore, the relationships for both four-dimensional full MIRT methods and the two-dimensional orthogonal full MIRT methods suggested smaller differences between forms at low abilities and larger differences between forms at high abilities, in comparison to the methods in the other group. It is difficult to hypothesize about the reason(s) for this pattern; however, the fact that it was seen across two different subject areas and with tests that had different dimensional structures warrants further attention. Unlike with the Chemistry and Spanish exams, the equating relationships for English were all very similar, and no clear groupings of equating results emerged. All equating relationships had the same general pattern and, if used operationally, would all result in Old Form equivalents that were within one raw score point of one another. Summary In general, it seems that the shapes of the equating relationships for the traditional equipercentile, UIRT, Bifactor, and full MIRT methods had very similar features with a

132 110 few exceptions. The first exception applies to the Single Population datasets where Identity equating served as the criterion. For Chemistry, Spanish, and English, the traditional equipercentile method resulted in an estimated equating relationship that was generally most different from the Identity relationship. The relationships found by using the traditional equipercentile method were also generally more curvilinear than the other methods. There have been mixed findings in the literature on whether traditional or IRT equating methods are more robust to group differences (Schmitt et al., 1990; Eignor, Stocking, & Cook, 1990). Therefore, it is difficult to hypothesize the extent to which the presence of group differences affected each of the studied equating methods in this dissertation. Even though examinees were drawn from the same population (i.e., the original Old Form), the act of sampling 3,000 to create the pseudo- Old and New Forms may have resulted in group differences that were large enough to impact the traditional equipercentile method but not the (M)IRT methods. On the other hand, the traditional equipercentile equating relationship may represent the small group differences more accurately than the (M)IRT methods. At this point, it is difficult to make that judgment since the population equating relationship is unknown. The second exception to the claim that all methods generally resulted in similar equating relationships was found with the Matched Samples Chemistry and Spanish exams. Specifically, the two-dimensional full MIRT orthogonal method behaved more similar to both four-dimensional full MIRT methods than it did to the two-dimensional full MIRT correlated method. This might have been related to the fact that the estimated correlations between latent factors for the four-dimensional solutions were generally low (between ± 0.01 to 0.13 and ± 0.02 to 0.38 for Chemistry and Spanish, respectively), whereas the correlations between latent factors for the two-dimensional solutions were higher (between ± 0.09 to 0.25 and ± 0.39 to 0.56 for Chemistry and Spanish, respectively). Therefore, differences between the full MIRT four-dimensional orthogonal and correlated models were less than the differences between the two-dimensional

133 111 orthogonal and correlated models because the estimated correlations in the fourdimensional correlated models were not that large to start with. Since the latent correlations for the full MIRT four-dimensional models were not very large, one would expect the full two-dimensional orthogonal model (as opposed to the two-dimensional correlated model) to be more similar to both four-dimensional models. Effect of Latent Trait Structure with Full MIRT Equating Research Objective #3 In the MIRT literature it is typically more common to find latent traits treated as orthogonal because doing so simplifies procedures such as scale linking. However, it may not always be realistic to treat latent traits as mutually orthogonal, especially with educational data. Therefore, one objective of the current dissertation was to investigate the effect of treating latent abilities as orthogonal versus letting the correlation between them be freely estimated. This effect was investigated with the full MIRT two- and fourdimensional models because the latent abilities involved with the Bifactor are always assumed to be mutually orthogonal. In general, the differences in equating relationships that were associated with the structure of the latent abilities were not consistent across the Single Population and Matched Samples datasets. Therefore, this section is organized accordingly by datasets and then by AP subject. Single Population Datasets For Chemistry and among the two-dimensional full MIRT methods, the correlated models generally resulted in raw and scale score equating relationships that were more similar to the Identity criterion. This makes sense with a unidimensional exam like Chemistry because allowing traits to correlate reduces the uniqueness between them, and specifies a model with traits that have greater commonalities. However, the equating relationships for the four-dimensional full MIRT methods were very similar, regardless of whether latent traits were constrained to be orthogonal or correlations between them were freely estimated. For both the two- and four-dimensional methods, using different

134 112 priors during item calibration led to slightly different equating relationships when latent abilities were allowed to correlate. However, there were no differences in equating relationships for the orthogonal two- or four-dimensional models that resulted from the use of different item prior settings. For Spanish Language, the two- and four-dimensional full MIRT methods, the orthogonal models led to equating relationships that were generally more similar to the criterion relationship for raw and scale scores. For the four-dimensional solutions, the pattern of the equating relationships for the orthogonal and freely correlated conditions were clearly different from one another. This may be related to the fact that the content subdomains for Spanish include Reading, Listening, Writing, and Speaking, and are quite unique. Again, the choice of priors made no discernable difference in the raw or scale score equating relationships when the latent abilities were mutually orthogonal, but resulted in noticeable differences when latent abilities were free to correlate. For English, specifying the latent traits as orthogonal or freely correlated for the two-dimensional full MIRT methods had little impact on the raw and scale score equating relationships. Both approaches led to equating relationships that were very similar to the Identity criterion. The choice of priors made a slight difference in the equating relationship when traits were freely correlated, but made no discernable difference when traits were mutually orthogonal. Matched Samples For Chemistry, greater differences between the orthogonal and correlated methods were apparent for the full MIRT two-dimensional conditions than the full MIRT fourdimensional conditions. For the two-dimensional full MIRT methods, the raw and scale score equating relationship when traits were orthogonal, was below that for the freely correlated conditions by about 1 raw score point and 0.5 scale score points, respectively. Whereas, the raw and scale score equating relationships for the orthogonal and correlated four-dimensional methods were generally overlapping. For both the two- and four-

135 113 dimensional conditions, the choice of item prior settings made a small difference when traits were freely correlated, but made no difference when traits were orthogonal. For Spanish Language, the patterns of the equating relationships were very similar regardless of whether the latent traits were allowed to correlate or not. For both the twoand four-dimensional full MIRT methods, the raw and scale score equating relationships when traits were specified as orthogonal were slightly lower than the relationships from when trait correlations were freely estimated. For the two- and four-dimensional conditions, the choice of item prior setting had virtually no impact on the equating relationships when traits were specified as orthogonal. However, when the correlations between the latent traits were freely estimated and a four-dimensional solution was used (but not applicable to the two-dimensional solution), the choice in item prior settings made a noticeable difference on the raw and scale score equating relationships. For English, the pattern and placement of the estimated equating relationships for the two-dimensional orthogonal and correlated conditions were very similar to one another. The choice of item prior settings had no influence on the equating relationships when latent traits were specified as mutually orthogonal, and made only a slight different (but only at the upper and lower ends of the score scale) when correlations between latent traits were freely estimated. Summary In both Single Population and Matched Samples datasets, the shapes of the equating relationships for raw and scale scores were very similar regardless of whether traits were treated as orthogonal or correlated. The one exception was for the Spanish equating relationships in the Single Population datasets, but this difference was with respect to the four-dimensional models only. While, the patterns of the equating relationships were generally similar, the placements of the relationships sometimes varied depending on if the latent traits were treated as orthogonal or correlated, but no clear trends were evident. For the Chemistry and English Single Population datasets and the

136 114 English Matched Samples datasets, the full MIRT correlated and orthogonal equating relationships intersected one another too frequently to be considered different. However, for the Chemistry and Spanish Matched Samples datasets, the orthogonal equating relationships tended to be below that of the correlated relationships for the two- and fourdimensional models. At this time, it is hard to determine what factors may have contributed to this pattern. Another noticeable trend across datasets was that estimated equating relationships seemed to be more affected by choice of item prior settings with models that allowed the correlation between latent traits to be freely estimated. Therefore, if there are concerns over which prior settings to use, it may be beneficial to use a MIRT model with orthogonal latent traits since priors are less likely to affect the estimated equating relationship. However, for full MIRT models with more than two dimensions, specifying traits as mutually orthogonal may lead to less accurate estimated marginal distributions, such as those found in this dissertation when full MIRT four-dimensional models were used. Effect of Modeling Dimensionality According to Content or Item Format Research Objective #4 The Bifactor model is specified such a way so that items are assigned to a general factor and to one group-specific factor, thereby making it a confirmatory model. When the Bifactor model was used in this dissertation, all items were assigned to load on one general dimension and on one group-specific dimension. The group-group specific dimensions corresponded to either item format (labeled BF-Format) or content subdomain (labeled BF-Content), but not both simultaneously. Lee and Lee (2014) presented a Bifactor observed score equating method where item format represented the group-specific factor; however, there are no equating studies that defined group-specific factors according to content. Therefore, it was important to determine whether defining the group-specific factor according to item format versus content subdomain would make

137 115 a significant impact on equating results. The answer to this question depends on how much of the variance in students responses is attributable to item format or content subdomain, and for this reason, interpretations were made in light of dimensionality assessments (particularly, disattenuated correlations). Since the group-specific dimension in the Bifactor model represents the residual variance not accounted for by the general dimension, large differences between approaches were not expected. The following discussion is broken into two sections where the first focusses on the Single Population datasets and the second focusses on the Matched Samples datasets. Single Population Datasets In general, the estimated equating relationships for raw and scale scores were very similar for the BF-Content and BF-format methods, with the exception of Chemistry. For Chemistry, the equating relationship using the BF-Format method was rather flat and always within the DTM standard for raw and scale scores. However, the raw and scale score equating relationships produced by the BF-Content method were very bumpy and exceeded the DTM standard for raw scores near the upper end of the ability scale. The use of different item prior settings did not result in very different estimated equating relationships for the BF-Content or BF-Format methods; however, differences were slightly larger for the BF-Content method. For Spanish and English, the BF-Content and BF-Format estimated equating relationships for raw and scale scores possessed the same pattern and were very similar to one another overall. Slight differences between the BF-Content and BF-format equating relationships were seen for Spanish, but only at the upper end of the score scale. The use of different prior settings had virtually no effect on either the BF-Content or BF-Format methods, across both Spanish and English. Matched Samples Datasets In general, less consistencies were seen between the BF-Content and BF-Format estimated equating relationships for the Matched Samples datasets. This trend may be

138 116 related to the fact that for these datasets, the sampled New and Old Form groups were not as similar as the groups used with the Single Population datasets, particularly with Chemistry. For Chemistry, the BF-Content and BF-Format equating relationships were quite similar, with the exception that the BF-Content method was slightly bumpy, but not nearly as bumpy as BF-Content method with the Chemistry Single Population datasets. The use of different prior settings didn t influence either of the Bifactor methods for Chemistry. For Spanish, the Bifactor estimated equating relationships were similar in general, but crossed one another near the middle of the score scale. For example, at the lower end of the ability scale, the BF-Content equating relationship was above the BF-Format relationship, and then this pattern switched near the middle of the ability scale and resulted in the BF-Content relationship being located below the BF-Format relationship at the upper end of the ability scale. Throughout the score scale, differences in the Bifactor equating relationships were never very large, but were noticeably larger when prior settings (I) were used. Furthermore, it appeared as if prior settings affected the BF- Content method more so than the BF-Format method. This finding may be related to the larger number of parameters that need to be estimated in the BF-Content method in comparison to the BF-Format method. Lastly, the BF-Content relationships appeared slightly bumpy in comparison to the BF-Format relationships. The biggest difference between the BF-Content and BF-Format estimated equating relationships was found for the English exams. Specifically, the BF-Content relationship was consistently parallel to, and below the BF-Format relationship, across the entire score scale. This finding may suggest that the forms and/or examinee groups differed more with respect to item format effects at the lower end of the ability scale and differed more with respect to content subdomain at the upper end of the score scale. This is because a bigger adjustment is being made with the BF-Format method at the lower

139 117 end of the scale while the BF-Content method adjusts for differences to a greater extent at the higher end of the scale. Actually, the disattenuated correlation between MC and FR item formats was slightly less than the correlation between the two content subdomains. The use of different prior settings didn t influence either of the Bifactor methods for English. Summary The effect of defining group-specific factors according to item format in comparison to content subdomain varied depending on which datasets (i.e., Single Population or Matched Samples) were being referenced. It is important to note that comparisons were made based on the estimated equating relationships themselves and were not based on differences between the Bifactor and respective criterion method. Therefore, the fact that equating criteria differed between datasets should be unrelated to the differences seen between datasets. However, differences between datasets may be related to the size of group differences since fewer differences were found for the Single Population datasets. In general, changing the item prior settings typically did not have a large impact on the estimated equating relationships for either Bifactor method. A noticeable trend seen across datasets was that the BF-Content equating relationships were sometimes bumpy whereas this was never found for the BF-Format relationships. This finding may be related to the increase in the number of parameters that needed to be estimated with the BF-content method. It should also be noted that during the estimation of the marginal total score distribution phase (but not the item calibration phase) under the BF-Content model, a reduced number of quadrature points (i.e., 11 as opposed to 25) was used due to convergence problems associated with using more quadrature points. It should be noted that the reduced number was only used during the posterior distribution estimation phase (and not during item parameter estimation), but this step was needed in order to obtain the estimated marginal ability distribution that were used later to conduct equating. The bumpy pattern sometimes observed with the BF-

140 118 Content method is likely related to the use of fewer quadrature points. After observing this unexpected bumpy pattern a mini-experiment was conducted using the Chemistry Single Population datasets. Specifically, the BF-format and both full MIRT fourdimensional models were used to estimate marginal ability distributions using 11 and 25 quadrature points. For all three models, the estimated marginal distributions were noticeably bumpier when only 11 quadrature points were used and were smooth with the use of 25 points. Therefore, the computer memory required in order to estimate marginal ability distributions that are required for (M)IRT observed score equating methods, is likely too burdensome to consider these models practically feasible. Even with the few instances where the BF-Content and BF-Format equating relationships appeared different, the majority of the time differences between the two methods were trivial. This is likely related to the fact that under the Bifactor model, the majority of the variance is attributed to the general factor, which remained constant across the BF-Content and BF-Format methods. The residual variance, or the variance not accounted for by the general factor is then accounted for by the group-specific factors. However, if there is not a lot of residual variance left over, then the way it is modeled may not make that much of a difference on ability estimation and hence on the equating relationships. Therefore, if observed score Bifactor equating is desirable, then it may be better to specify the group-specific factors according to a construct like item format, that doesn t have very many levels. Even though the Bifactor method has the advantage of being a reduction method, in order to obtain the marginal observed score distributions required for equating, the computation demand is equivalent to an m + 1 dimensional full MIRT model. Therefore, Bifactor models with more than three groupspecific factors can be challenging to estimate for the purpose of observed score IRT equating.

141 119 Limitations and Future Considerations Like all studies, this dissertation is not without several limitations that are in need of mentioning. Several of the limitations of this dissertation can be linked to the fact that all analyses were conducted using real data. While this is not a limitation in and of itself, it did invoke specific challenges which will be discussed further. These limitations should be considered before any attempts are made to generalize conclusions found in this dissertation to external situations. In addition to outlining the limitations of this dissertation, specific ways in which it could be improved are also discussed. The biggest challenge associated with the exclusive use of real data was that the true equating relationships for the AP Exams were not known. This resulted in the need to rely on criteria that were not perfect, and contained some error. With this in mind, an attempt was made to quantify the amount of error in one of the equating criteria the traditional equipercentile method by including it in conditions where equating was theoretically unnecessary (i.e., equating a test to itself in the Single Population datasets). As it turned out, the traditional equipercentile equating method performed quite differently from the other methods in the Single Population datasets. However, it is difficult to say whether the Identity, traditional equipercentile, or (M)IRT methods were closest to the population equating relationships since small group differences were present. The effect of these differences on the population equating relationship is hard to predict, however it should be expected that the resulting equating relationship would depart at least somewhat from the Identity relationship. At this point, the traditional equipercentile method is likely the best available criterion for equating studies using Matched Samples from real data, however further research is needed in this area. The second criteria, and that used with the Single Population datasets, was the Identity relationship, where a score on the New Form equals that same score on the Old Form. While this criterion was much more theoretically appealing than the equipercentile relationship, the parameters surrounding it were not ideal. For example, the pseudo-forms

142 120 sampled and used with the Identity criterion were limited to a form size of 3,000, which was likely too small to completely eliminate random noise. As a result, small group differences were present, and were evident in the studied equating methods. However, the Identity equating relationship does not take group differences into account and thereby mispecifies the population equating relationship to some degree. Even though larger samples sizes were attainable, the decision to use 3,000 examinees per form was made in order to be consistent with the Matched Samples datasets which were limited by the sample sizes of the New Form. Moving forward, it would be better to use the largest available sample sizes permissible by the constraints of the original forms in order to eliminate as much random error as possible. For each of the three subjects used in this dissertation, it would be possible to create pseudo-groups consisting of approximately 50,000 examinees. Therefore, analyses performed with the Single Population datasets should be replicated in a future study using form sample sizes near 50,000 in order to see whether findings in this dissertation were greatly influenced by random sampling error. It may be the case that sample size was the biggest limiting factor in this dissertation and not the chosen equating criterions. Some challenges associated with not knowing the true equating relationship could have been ameliorated by including a simulation study and specifying the true equating relationship a priori. By knowing the true equating relationship and then estimating the same relationship using several different equating methods, claims concerning how close the studied methods came to recovering the true relationship could be made. However, the greatest challenge associated with a simulation study of this kind is how to define the true equating relationship so that it is not biased towards any particular method. For example, a study by Lee and Brossman (2012) used a simple structure MIRT (SS-MIRT) observed score equating method as the true equating relationship and then estimated the equating relationship using UIRT and SS-MIRT observed score methods and compared differences. In this situation, the estimated equating relationship from the SS-MIRT

143 121 method (as opposed to the UIRT method) should behave more similarly to the true relationship since the true relationship was established using the same method, which is what was found. Therefore, this study showed that the SS-MIRT method behaved as expected, but not necessarily superior to the UIRT method. Simulation studies need to be conducted in a manner so to inform practice, but more importantly, so that results that are not biased, which evidentially can be hard to achieve. Another obvious limitation in this dissertation was that the AP datasets were not very multidimensional. This posed a serious challenge since a primary objective was to present and evaluate a full MIRT observed score equating method. The AP exams chosen in this dissertation were selected because they either showed evidence of being unidimensional (i.e., Chemistry) or multidimensional (i.e., Spanish and English), and represented a realistic range of dimensional structures. However, even for the subject areas that showed the highest degree of multidimensionality (i.e., Spanish and English) an argument could be made that the UIRT assumption of unidimensionality was still not severely violated for those exams. Therefore, the equating relationships found in this dissertation, and the comparative judgments made among them, may have been quite different if more multidimensional data were used. This dissertation could be improved by including a small scale simulation study to impose a higher dimensional data structure. Specifically, latent abilities and item discriminations could be sampled jointly from a multivariate distribution, and correlations between the m latent variates could be altered. Therefore, the degree of multidimensionality could be varied from weak to strong by specifying between-latent trait correlations ranging from 0.99 to 0.50, respectively. This would allow for various UIRT and MIRT observed score equating procedures to be studied under a range of dimensionality conditions. A simulation study of this type would best be suited to accompany real data analyses and would not be meant as a replacement, since it is not without limitations. Perhaps the biggest criticism of simulation studies is that they are

144 122 often not realistic enough to inform practice. However, a simulation study could be useful in the context of developing MIRT equating guidelines if it were used in conjunction with real data analyses, and the two sets of results were used in conjunction with one another. In addition to overcoming the challenges imposed by using real data, the overall design of this dissertation could be improved in several ways as well. First, the manner in which the full MIRT equating methods were carried out was exploratory, such that items were free to load on all latent dimensions. For the majority of the exams, the pattern of item loadings or item discriminations, did not align to the test specifications and were otherwise uninterpretable. This type of exploratory approach is less likely to be carried out in an operational setting given the large amount of time devoted to test development processes. Therefore, this study should be expanded to include the SS-MIRT observed score method where items could be assigned to load on one of several dimensions as outlined by the test blueprint. Another limitation related to the design of this dissertation is the overwhelming amount of information presented. Ordinarily this wouldn t be considered a limitation; however, the number of conditions contained in this dissertation sometimes made it difficult to focus on the most meaningful findings because all findings needed to be acknowledged. Given the uncertainties associated with using the traditional equipercentile method as a criterion, it may be beneficial to use only the Identity criterion if this dissertation were repeated or expanded in the future. If one of the objectives of this dissertation was to determine the best equating criterion, then several criteria would certainly have be needed; however, that wasn t the focus. Another way to narrow the scope of this dissertation would be to take into account the results of the dimensionality assessments prior to specifying the number of dimensions in the full MIRT models. For this dissertation, it may have made sense to exclude the full MIRT four-dimensional conditions since the results of the dimensionality assessments suggested a two-dimensional solution, or were mixed. Originally it was

145 123 decided to include the four-dimensional full MIRT models because there were four levels of content subdomains for Chemistry and Spanish. However, the full MIRT models were conducted in an exploratory manner and as it turned out the four estimated dimensions did not correspond to the four content subdomains. Based on the dimensionality assessment and item calibration results, it may have been sufficient to only include the full MIRT two-dimensional conditions, which would ve lessened the information overload. Lastly, all MIRT equating procedures are subject to the limitation associated with the parallelism requirement which must be met in order for equating to be considered appropriate. This prerequisite to equating becomes more difficult to verify as the number of latent abilities increases, as it does when one moves from a UIRT to a MIRT framework. In a MIRT framework, there are two or more abilities to account for, and all must have some degree of equivalency across the New and Old Form groups, as opposed to one ability with UIRT methods. In addition, the relationships between abilities should also be similar across form groups. Therefore, the requirement of parallelism becomes more difficult to verify as the number of latent abilities increases. However, this may be considered less of an issue for the Random Groups data collection design, since groups are assumed to be randomly equivalent with respect to all abilities being measured. Conclusions In this dissertation, a full MIRT observed score equating procedure for mixedformat tests was introduced and compared with traditional equipercentile, UIRT, and Bifactor equating methods. For the Bifactor methods, group-specific factors were modeled according to item format or content subdomain, and either two or four dimensions were specified for the full MIRT methods. AP Chemistry, Spanish Language, and English Language and Composition exams were used to illustrate the equating procedures and comparisons were made with respect to two different criteria the Identity and traditional equipercentile methods.

146 124 Several important conclusions were stated in the current dissertation and will be reviewed here. In general, it was found that the multidimensional methods worked better for datasets that evidenced more multidimensionality, whereas unidimensional methods worked better for unidimensional datasets. Specifically this trend was found with the Single Population datasets and is in reference to wrmsds and AP Grade agreement percentages. However, even for cases when the dimensional structure of the data didn t match the (M)IRT equating method, AP Grade agreement percentages were still generally high, which introduces the next point that the scale matters. The scale on which scores are reported influenced the comparative conclusions made among the studied methods. For example, with Spanish Language Single Population datasets, the full MIRT four-dimensional orthogonal equating relationship was by far, the most similar to the criterion for raw and scale scores. However, the AP Grade equivalents resulting from the full MIRT four-dimensional orthogonal and UIRT methods were the same for over 99% of the score scale. Therefore, on the AP Grade scale, which is operationally the most important reporting scale for examinees, there typically were not large discrepancies among the UIRT, Bifactor, and full MIRT methods. This was also true for the most unidimensional exam, Chemistry, and suggests that overestimating the dimensionality may not have a significant impact on equating results for reporting scores similar to AP Grades. Another important finding was that the Bifactor observed score equating method may not be very sensitive to the manner in which the group-specific factors are defined. There were generally very little differences between the equating relationships for the BF-Content and BF-Format methods, with the exception of the Single Population Chemistry and Matched Sample English exams. Therefore, when the Bifactor method is preferred, it may be beneficial to choose the more parsimonious version of the Bifactor model due to practical constraints associated with estimating the marginal ability distributions.

147 125 In general, it seemed as though the AP Exams used in this dissertation either met the UIRT assumption of unidimensionality or were not in severe violation of it. It is worth mentioning that Spanish and English evidenced the greatest degree of multidimensionality among the 14 available AP Exams that could ve been included in this dissertation, but were still not very multidimensional. This calls into question the costs and benefits associated with researching and developing new MIRT equating methodology. On the contrary, if the majority of tests are unidimensional, but a small fraction are multidimensional, the costs associated with improving the measurement procedures for even a small fraction of tests has merit. In conclusion, similarities in equating relationships were usually seen within the UIRT and Bifactor methods and within the full MIRT methods, and in some instances very few differences were found across all methods (excluding the traditional equipercentile). Of importance is that when the unidimensionality assumption is suspect, using a multidimensional equating method is likely to provide a reasonable alternative, and is theoretically more defensible. However, conclusions made in this dissertation were based on a small sample of AP exams and should be verified using additional exams that differ from AP exam characteristics and by conducting a realistic simulation study. Any time an IRT observed score equating method is considered with mixedformat exams, special attention needs to be given to the dimensionality of the datasets. Based on the results from this dissertation, standard exploratory analyses such as principal components analysis and exploratory factor analysis may not provide useful information concerning the latent dimensional structure, but may be useful for getting a rough estimate of the number of latent dimensions. Actually, the information provided by principal components analysis and disattenuated correlations appeared to be more reliable or stable than results from Poly-DIMTEST. However, these dimensionality assessments are likely to only provide useful information when used in a confirmatory manner. Based on the results in this dissertation, exploratory factor analysis does not appear to be a

148 126 viable option for producing latent dimensions that lend well to meaningful interpretations. Therefore, at this time, it may be better to specify latent dimensions according to test blueprints and use model fit statistics and disattenuated correlations to determine whether the proposed model is supported. With this said, a simple structure MIRT or Bifactor observed score equating method may be the best options for equating mixed-format exams since they are both confirmatory in nature. The benefit of taking a confirmatory MIRT approach is the latent dimensions are guaranteed to be interpretable, which cannot be said for the full MIRT procedure presented in this dissertation.

149 127 REFERENCES Ackerman, T. (1996). Graphical representation of multidimensional item response theory analyses. Applied Psychological Measurement, 20, Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19, Baker, F. B (1993). Equating tests under the nominal response model. Applied Psychological Measurement, 17, Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Addison-Wesley, Reading, MA. Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12, Bolt, D. M, & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using markov chain monte carlo. Applied Psychological Measurement, 27, Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp ). New York: Springer. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, Brennan, R. L., & Kolen, M. J. (1987a). Some practical issues in equating. Applied Psychological Measurement, 11, Brennan, R. L., & Kolen, M. J. (1987b). A reply to Angoff. Applied Psychological Measurement, 11, Brossman, B. G., & Lee, W. (2013). Observed score and true score equating procedures for multidimensional item response theory. Applied Psychological Measurement, 37, Cai, L. (2012). flexmirt (Version 1.88) [Computer program]. Chapel Hill, NC: Vector Psychometric Group, LLC. Cai, L., Yang, J. S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychological Methods, 16,

150 128 Camilli, G., Wang, M., & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission Test. Journal of Educational Measurement, 32, Carroll, J. B. (1993). Human cognitive abilities: A survey of factor analytic studies. Cambridge University Press, New York. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, Cohen, A. S., & Kim, S.-H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22, Cook, L. L., Dorans, N.., Eignor, D. R., & Petersen, N. S. (1985). An assessment of the relationship between the assumption of unidimensionality and the quality of IRT true-score equating (ETS Research Report 85-30). Princeton, NJ: Educational Testing Services. Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensioinal item calibrations. Applied Psychological Measurement, 20, de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. de Champlain, A. F. (1996). The effect of multidimensionality of IRT true-score equating for subgroups of examinees. Journal of Educational Measurement, 33, de la Torre, J. (2008). Multidimensional scoring of abilities: The ordered polytomous response case. Applied Psychological Measurement, 32, DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13, Divgi, D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9, Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22, Dorans, N. J., Holland, P. W., Thayer, D. T., & Tateneni, K. (2003). Invariance of score linking across gender groups for three advanced placement program exams. In N. J. Dorans (Ed.), Population invariance of score linking: Theory and applications to advanced placement program examinations (pp ), Research Report Princeton, NJ: Educational Testing Service.

151 129 Eignor, D. R., Stocking, M. L., & Cook, L. L. (1990). Simulation results of effects on linear and curvilinear observed- and true-score equating procedures of matching on a fallible criterion. Applied Measurement in Education, 3, Feldt, L. S. (1975). Estimation of the reliability of a test divided into two parts of unequal length. Psychometrika, 40, Frederiksen, N., Glaser, R., Lesgold, A., & Shafto, M. G. (Eds.) (1990), Diagnostic monitoring of skill and knowledge acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. Gibbons, R. D., & Hedeker, D. R. (1992). Full-information bi-factor analysis. Psychometrika, 57, Gilmer, J. S., & Feldt, L. S. (1981). Reliability estimation for a test with parts of unequal and unknown lengths. Unpublished manuscript. Gower, J. C., & Dijksterhuis, G. B. (2004) Procrustes problems. Oxford University Press, Oxford, England. Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, Hagge, S. L., & Kolen, M. J. (2011). Equating mixed-format tests with format representative and non-representative common items. In M. J. Kolen & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 1) (CASMA Monograph No. 1). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set of test items. Applied Psychological Measurement, 10, Hanson, B. A. (n.d.). Equating Error: A program for computing equating error using the bootstrap. (Windows Console Version, Revised by Z. Cui, May 18, 2004) [Manual]. Unpublished manuscript, College of Education, University of Iowa, Iowa City, Iowa. Hanson, B. A. (1994). An extension of the Lord-Wingersky algorithm to polytomous items. Unpublished research note. Iowa City, IA: ACT, Inc. He, W., Li, F., & Wolfe, E. (2011). Effects of item clusters on the recovery of linking constants using test characteristic curve linking methods: Bifactor model, 2PL IRT model, and graded response model. Paper presented at the AERA Annual Meeting, New Orleans, LA.

152 130 He, W., Li, F., Wolfe, E., & Mao, X. (2012). Model selection for equating testlet-based tests in the NEAT design: An empirical study. Paper presented at the 2012 Annual NCME Conference, Vancouver, BC. Healy, A. F., & McNamara, D. S. (1996) Verbal learning and memory: Does the modal model still work? In Spence, J. T., Darley, J. M., & Foss, D. J. (Eds.), Annual Review of Psychology, 47, Hirsch, T. (1989). Multidimensional equating. Journal of Educational Measurement, 26, Hurley, J. R., & Cattell, R. B. (1962). The Procrustes program: Producing direct rotation to test a hypothesized factor structure. Behavioral Science, 7, Jang, E. E., Roussos, L. (2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44, Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, Kolen, M. J. (1990). Does matching in equating work? A discussion. Applied Measurement in Education, 3, Kolen, M. J. (2005). RAGE-RGEQUATE. [Computer program]. Iowa City: University of Iowa. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (Second ed.). New York: Springer. Lee, W., & Brossman, B. G. (2012). Observed score equating for mixed-format tests using a simple-structure multidimensional IRT framework. In M. J. Kolen & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 2) (CASMA Monograph No. 2.2.) Iowa City: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Available on Lee, G., & Lee, W. (2014). Bi-factor MIRT observed-score equating for mixed-format tests. Manuscript submitted for publication. Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, Li, Y. H., & Stout, W. D. (1995). Assessment of unidimensionality for mixed polytomous and dichotomous item data: Refinements of Poly-DIMTEST. Paper presented at

153 131 the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA. Li, Y., Li, S., & Wang, L. (September, 2010). Application of a general polytomous testlet model to the reading section of a large-scale English language assessment (ETS Research Report 10-21) Princeton, NJ: Educational Testing Service. Li, Y. H., Lissitz, R. W., & Yang, Y. N. (April, 1999). Estimating IRT equating coefficients for tests with polytomously and dichotomously scored items. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada. Livingston, S. A., Dorans, N. J., & Wrights, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3, Lord, F. M. (1980). Applications of item response theory to practical testing problems. NJ: Erlbaum. Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8, Loyd, B. H., & Hoover, H. D. (1980) Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, McDonald, P. R., & Ahlawat, K. S. (1974). Difficulty factors in binary data. British Journal of Mathematical and Statistical Psychology, 27, McKinley, R. L., & Reckase, M. D. (1982). The use of the general Rasch model with multidimensional item response data (Research Report ONR 82-1). Iowa City, IA: American College Testing. McKinley, R. L., & Reckase, M.D. (1983). An extension of the two-parameter logistic model to the multidimensional latent space (Research Report ONR 83-2). Iowa City, IA: The American College Testing Program. Mulaik, S. A. (1972). A mathematical investigation of some multidimensional Rasch models for psychological tests. Paper presented at the annual meeting of the Psychometric Society, Princeton, NJ.

154 132 Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, Muraki, E., & Behar, I. I. (1995, July). Full information factor analysis of polytomous item responses of NCARB exams. Paper presented at the European Meeting of the Psychometric Society, Leiden, The Netherlands. Muraki, E., & Carlson, J. E. (1993). Full-information factor analysis for polytomous item responses. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA. Muraki, E. & Carlson, J. E. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement, 19, Muthén, L. K., & Muthén, B. O. ( ). Mplus User's Guide. Seventh Edition. Los Angeles, CA: Muthén & Muthén. Oshima, T., Davey, T., & Lee, K. (2000). Multidimensional linking: four practical approaches. Journal of Educational Measurement, 37, Petersen, N. S., Marco, G. L., & Stewart, E. E. (1982). A test of the adequacy of linear score equating models. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp ). New York: Academic Press Inc. Powers, S., Hagge, S. L., Wang, W., He, Y., Liu, C., & Kolen, M. J. (2011). Effects of group differences on mixed-format equating. In M. J. Kolen & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 1) (CASMA Monograph No. 1). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. Powers, S., & Kolen, M. J. (2011). Evaluating equating accuracy and assumptions for groups that differ in performance. In M. J. Kolen, & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 1) (CASMA Monograph No. 1). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. Powers, S., & Kolen, M. J. (2012). Using matched samples equating methods to improve equating accuracy. In M. J. Kolen, & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 2) (CASMA Monograph No. 2). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Available on Rasch, G. (1960) Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.

155 133 Reckase, M. D. (1972). Development and application of a multivariate logistic latent trait model. Unpublished doctoral dissertation, Syracuse University, Syracuse, NY. Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4, Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, Reckase, M. D. (2009). Multidimensional item response theory. New York, NY: Springer. Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15, Reise, S. P., Waller, N. G, & Comrey, A. L. (2000). Factor analysis and scale revision. Psychological Assessment, 12, Rijmen, F. (2009). Efficient full information maximum likelihood estimation for multidimensional IRT models (Tech. Rep. No. RR-09-03). Princeton, NJ: Educational Testing Service. Samejima, F. (1969). Estimation of ability using a response pattern of graded scores. Psychometrika Monograph, No. 17. Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph, No. 18. Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70, Schmitt, A. P., Cook, L. L., Dorans, N. J., & Eignor, D. R. (1990). Sensitivity of equating results to different sampling strategies. Applied Measurement in Education, 3, Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, Stout, W., Douglas, B., Junker, B., & Roussos, L. (1999). DIMTEST [Computer software]. The William Stout Institute for Measurement, Champaign, IL.

156 134 Stout, W., Froelich, A. G., Gao, F. (2001). Using resampling to produce and improved DIMTEST procedure. In A. Boomsma, M.A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp ). New York: Springer- Verlag. Sympson, J. B. (1978). A model for testing with multidimensional items. In D. J. Weiss (Ed.), Proceedings of the 1977 Computerized Adaptive Testing Conference, University of Minnesota, Minneapolis. Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items. Journal of Educational Measurement, 37, Tate, R. (2003). A comparison of selected empirical methods for assessing the structure of responses to test items. Applied Psychological Measurement, 27, Thissen, D., Cai, L., & Bock, R. D. (2010). The nominal categories item response model. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp ). New York, NY: Taylor & Francis. Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item respond theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, Thissen, D., Wainer, H., & Wang, X.-B. (1994). Are tests comprising both multiplechoice and free-response items necessarily less unidimensional than multiplechoice tests? An analysis of two tests. Journal of Educational Measurement, 31, Thompson, T. D., Nering, M., & Davey, T. (1997, June). Multidimensional IRT scale linking without common items or common examinees. Paper presented at the annual meeting of the Psychometric Society, Gatlinburg TN. Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, Walker, C. M., & Beretvas, S. N. (2003). Comparing multidimensional and unidimensional proficiency classifications: Multidimensional IRT as a diagnostic aid. Journal of Educational Measurement, 40, Wang, M. (1985). Fitting a unidimensional model to multidimensional item response data: The effect of latent space misspecification on the application of IRT (Research Report MW: ).University of Iowa, Iowa City, IA.

157 135 Wang, M. (1986). Fitting a unidimensional model to multidimensional item response data. Paper presented at the Office of Naval Research Contractors Meeting, Gatlinburg, TN. Wang, W.-C., Chen, P. H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9, Whitely, S. E. (1980). Measuring aptitude processes with multicomponent latent trait models (Technical Report No. NIE-80-5). University of Kansas, Lawrence, KS. Yao, L., & Boughton, K. (2009). Multidimensional linking for tests with mixed item types. Journal of Educational Measurement, 46, Yao, L., & Schwarz, R. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30, Yen, W. M. (1984). Effects of local dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, Yu, F., & Nandakumar, R. (1997). The impact of second ability dimension on the unidimensional estimation of item and ability parameters. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. Zhang, J. (1996). Some fundamental issues in item response theory with applications (Unpublished doctoral dissertation). Department of Statistics, University of Illinois at Urbana-Champaign. Zhang, J. M., & Stout, W. F. (1999a). Conditional covariance structure of generalized compensatory multidimensional items. Psychometrika, 64, Zhang, J. M., & Stout, W. F. (1999b). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64, Zhang, J., & Wang, M. (1998, April). Relating reported scores to latent traits in a multidimensional test. Paper presented at the annual meeting of American Educational Research Association, San Diego, CA. Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3 [computer program]. Chicago, IL: Scientific Software.

158 136 APPENDIX: TABLES AND FIGURES Table A1. Scores and Weights of the MC and FR Items by AP Main Form Exam. Operational Settings Studied Settings Exam Section N Items Weights Section N Items Weights (Categories) (Categories) Chemistry MC 75 (0 1) MC 75 (0 1) 1 FR 2 (0 8) FR 1 (0 8) 1 1 (0 9) (0 9) 2 2 (0 10) (0 10) 2 1 (0 15) (0 15) 1 Spanish MC 34 (0 1) MC 34 (0 1) 1 Language 36 (0 1) (0 1) 1 FR 3 (0 5) FR 3 (0 5) 3 1 (0 5) (0 5) 6 English MC 54 (0 1) MC 54 (0 1) 1 FR 3 (0 9) FR 3 (0 9) 3 Table A2. Scores and Weights of the MC and FR Items by AP Alternate Form Exam. Operational Settings Studied Settings Exam Section N Items Weights Section N Items Weights (Categories) (Categories) Chemistry MC 75 (0 1) MC 75 (0 1) 1 FR 1 (0 8) FR 1 (0 8) 1 1 (0 9) (0 9) 2 1 (0 9) (0 10) (0 10) 2 1 (0 15) (0 15) 1 Spanish MC 34 (0 1) MC 34 (0 1) 1 Language 36 (0 1) (0 1) 1 FR 3 (0 5) FR 3 (0 5) 3 1 (0 5) (0 5) 6 English MC 54 (0 1) MC 54 (0 1) 1 FR 3 (0 9) FR 3 (0 9) 3

159 Table A3. Raw-to-scale score conversion table for AP Chemistry. Raw Composite Score Unrounded Scale Score AP Grade

160 Table A3. Continued

161 Table A3. Continued

162 Table A3. Continued Note: Conversion table was created for research purposes only. 140

163 Table A4. Raw-to-scale score conversion table for AP Spanish Language. Raw Composite Score Unrounded Scale Score AP Grade

164 Table A4. Continued

165 Table A4. Continued

166 Table A4. Continued Note: Conversion table was created for research purposes only. 144

167 Table A5. Raw-to-scale score conversion table for AP English Language and Composition. Raw Composite Score Unrounded Scale Score AP Grade

168 Table A5. Continued

169 Table A5. Continued Note: Conversion table was created for research purposes only. 147

170 148 Table A6. Calibration settings used during item parameter estimation. General Settings UIRT Bifactor Full MIRT Estimation Method BAEM BAEM BAEM Max. # of cycles 3,000 3,000 3,000 Convergence criterion 1e-6 1e-6 1e-6 Max. # of M-step iterations 3,000 3,000 3,000 Convergence criterion 1e-6 1e-6 1e-6 # of quadrature points Range of quadrature points [-6, 6] [-6, 6] [-6, 6] Prior Setting I Prior for a-parameter lognormal lognormal lognormal Prior for c-parameter Prior Setting II Prior for a-parameter (μ = 0, σ = 0.5) beta (α = 4, β = 16) lognormal (μ = -0.4, σ = 0.3) (μ = 0, σ = 0.5) beta (α = 4, β = 16) a 0 ~ normal (μ = 1.3, σ = 0.8) a s ~ lognormal (μ = -0.4, σ = 0.3) beta (μ = 0, σ = 0.5) beta (α = 4, β = 16) lognormal (μ = -0.4, σ = 0.3) beta beta Prior for c-parameter (α = 11, β = 91) (α = 11, β = 91) (α = 11, β = 91) Note. BAEM stands for the Bock-Aitkin Estimation Method. flexmirt uses (α-1) and (β- 1) to describe a beta distribution.

171 149 Table A7. Studied Conditions for Matched Sample and Single Population Datasets. Model AP Subject Ability Distribution Dimensionality Priors Condition Number Chemistry UVN (0, 1) N/A Set I 1 Set II 2 UIRT Spanish UVN (0, 1) N/A Set I 3 Set II 4 English UVN (0, 1) N/A Set I 5 Set II 6 MVN (0, I) Item Format Set I 7 ρ = 0 Set II 8 Content Set I 9 Bifactor Chemistry Spanish English Chemistry MVN (0, 1) ρ = estimated MVN (0, I) ρ = 0 MVN (0, 1) ρ = estimated MVN (0, I) ρ = 0 MVN (0, 1) ρ = estimated MVN (0, I) ρ = 0 MVN (0, 1) ρ = estimated Set II 10 Item Format Set I 11 Set II 12 Content Set I 13 Set II 14 Item Format Set I 15 Set II 16 Content Set I 17 Set II 18 Item Format Set I 19 Set II 20 Content Set I 21 Set II 22 Item Format Set I 23 Set II 24 Content Set I 25 Set II 26 Item Format Set I 27 Set II 28 Content Set I 29 Set II 30 2 Dimensions Set I 31 Set II 32 4 Dimensions Set I 33 Set II 34 2 Dimensions Set I 35 Set II 36 4 Dimensions Set I 37 Set II 38 MVN (0, I) 2 Dimensions Set I 39 ρ = 0 Set II 40 Full MIRT Spanish 4 Dimensions Set I 41 Set II 42 MVN (0, 1) 2 Dimensions Set I 43 ρ = estimated Set II 44 4 Dimensions Set I 45 Set II 46

172 150 Table A7. Continued MVN (0, I) 2 Dimensions Set I 47 Full MIRT English ρ = 0 Set II 48 MVN (0, 1) 2 Dimensions Set I 49 ρ = estimated Set II 50 Chemistry N/A N/A N/A 51 Equipercentile Spanish N/A N/A N/A 52 English N/A N/A N/A 53 Note: Set (I) refers to a ~ log normal (0, 0.5) and c ~ beta (5, 17) and Set (II) refers to a and a0 ~ normal (1.3, 0.8), as ~ log normal (-0.4, 0.3), and c ~ beta (11, 91). Table A8. Effect Sizes for Group Differences in Common Item Scores. Chemistry Spanish English Matched Samples Single Population Note: Negative values indicate the New Form group was higher achieving. Table A9. Descriptive Statistics for Forms Associated with the Matched Samples Design. Chemistry Spanish English New Form Old Form New Form Old Form New Form Old Form Scale Mean Median SD Minimum Maximum Kurtosis Skewness Reliability Note: Scale represents the total range of composite scores. Reliability is defined using stratified alpha where strata represent item format.

173 151 Table A10. Descriptive Statistics for Forms Associated with the Single Population Design. Chemistry Spanish English New Form Old Form New Form Old Form New Form Old Form Scale Mean Median SD Minimum Maximum Kurtosis Skewness Reliability Note: Scale represents the total range of composite scores. Reliability is defined using stratified alpha where strata represent item format. Table A11. Test Blueprint for Chemistry. Content Subdomain Item Format Content 1 Content 2 Content 3 Content 4 Multiple Choice Free Response Table A12. Test Blueprint for Spanish Language. Content Subdomain Item Format Content 1 Content 2 Content 3 Content 4 Multiple Choice Free Response Table A13. Test Blueprint for English Language and Composition. Content Subdomain Item Format Content 1 Content 2 Multiple Choice Free Response 2 1

174 152 Table A14. Poly-DIMTEST Test Statistic Probabilities for Matched Sample Forms. Chemistry Spanish English Approach New Form Old Form New Form Old Form New Form Old Form CFA Format CFA Content F1 as AT F2 as AT F3 as AT * F4 as AT * EFA * * Note: -- indicates the number of items in the content subdomain exceeded the limits imposed in Poly-DIMTEST. * p < Table A15. Disattenuated Correlations for Matched Sample Forms. Chemistry Spanish English Source New Form Old Form New Form Old Form New Form Old Form Format Content C1 with C C1 with C C1 with C C2 with C C2 with C C3 with C Table A16. AIC Indices for EFA and CFA Fitted Models. Chemistry Spanish English Fitted Model New Form Old Form New Form Old Form New Form Old Form 1 Factor EFA Factor EFA Factor EFA Factor EFA Format CFA Content CFA

175 153 Table A17. BIC Indices for EFA and CFA Fitted Models. Chemistry Spanish English Fitted Model New Form Old Form New Form Old Form New Form Old Form 1 Factor EFA Factor EFA Factor EFA Factor EFA Format CFA Content CFA Table A18. Presmoothing Parameters used for the Old and New Form Distributions. Dataset Chemistry Spanish English Matched Samples 6, 6 4, 4 5, 5 Single Population 6, 6 5, 5 5, 5 Note: The first and second values are smoothing parameters for the New and Old Forms, respectively. Table A19. Weighted Root Mean Squared Differences for Raw Scores Using the Single Population Datasets and Identity Criterion. Model (priors) Chemistry Spanish English UIRT (I) UIRT (II) Bifactor Format (I) Bifactor Format (II) Bifactor Content (I) Bifactor Content (II) Orthogonal 2D (I) Orthogonal 2D (II) Correlated 2D (I) Correlated 2D (II) Orthogonal 4D (I) Orthogonal 4D (II) Correlated 4D (I) Correlated 4D (II) Equipercentile

176 154 Table A20. Weighted Root Mean Squared Differences for Scale Scores Using the Single Population Datasets and Identity Criterion. Model Chemistry Spanish English UIRT (I) UIRT (II) Bifactor Format (I) Bifactor Format (II) Bifactor Content (I) Bifactor Content (II) Orthogonal 2D (I) Orthogonal 2D (II) Correlated 2D (I) Correlated 2D (II) Orthogonal 4D (I) Orthogonal 4D (II) Correlated 4D (I) Correlated 4D (II) Equipercentile Table A21. Weighted Root Mean Squared Differences for Raw Scores Using the Matched Samples Datasets and Traditional Equipercentile Criterion. Model Chemistry Spanish English UIRT (I) UIRT (II) Bifactor Format (I) Bifactor Format (II) Bifactor Content (I) Bifactor Content (II) Orthogonal 2D (I) Orthogonal 2D (II) Correlated 2D (I) Correlated 2D (II) Orthogonal 4D (I) Orthogonal 4D (II) Correlated 4D (I) Correlated 4D (II)

177 155 Table A22. Weighted Root Mean Squared Differences for Scale Scores Using the Matched Samples Datasets and Traditional Equipercentile Criterion. Model Chemistry Spanish English UIRT (I) UIRT (II) Bifactor Format (I) Bifactor Format (II) Bifactor Content (I) Bifactor Content (II) Orthogonal 2D (I) Orthogonal 2D (II) Correlated 2D (I) Correlated 2D (II) Orthogonal 4D (I) Orthogonal 4D (II) Correlated 4D (I) Correlated 4D (II) Table A23. Overall Exact Percent Agreement of AP Grades for Single Population Datasets. Model Chemistry Spanish English UIRT (I) UIRT (II) Bifactor Format (I) Bifactor Format (II) Bifactor Content (I) Bifactor Content (II) Orthogonal 2D (I) Orthogonal 2D (II) Correlated 2D (I) Correlated 2D (II) Orthogonal 4D (I) Orthogonal 4D (II) Correlated 4D (I) Correlated 4D (II) Equipercentile

178 156 Table A24. Overall Exact Percent Agreement of AP Grades for Matched Samples Datasets. Model Chemistry Spanish English UIRT (I) UIRT (II) Bifactor Format (I) Bifactor Format (II) Bifactor Content (I) Bifactor Content (II) Orthogonal 2D (I) Orthogonal 2D (II) Correlated 2D (I) Correlated 2D (II) Orthogonal 4D (I) Orthogonal 4D (II) Correlated 4D (I) Correlated 4D (II) Figure A1. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Chemistry using the Single Population Dataset.

179 157 Figure A2. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Spanish using the Single Population Dataset. Figure A3. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for English using the Single Population Dataset.

180 158 Figure A4. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Chemistry using the Matched Samples Dataset. Figure A5. Weighted Total Score Distributions for Pseudo Old and New Form Examinees for Spanish using the Matched Samples Dataset.

IRT linking methods for the bifactor model: a special case of the two-tier item factor analysis model

IRT linking methods for the bifactor model: a special case of the two-tier item factor analysis model University of Iowa Iowa Research Online Theses and Dissertations Summer 2017 IRT linking methods for the bifactor model: a special case of the two-tier item factor analysis model Kyung Yong Kim University

More information

Effectiveness of the hybrid Levine equipercentile and modified frequency estimation equating methods under the common-item nonequivalent groups design

Effectiveness of the hybrid Levine equipercentile and modified frequency estimation equating methods under the common-item nonequivalent groups design University of Iowa Iowa Research Online Theses and Dissertations 27 Effectiveness of the hybrid Levine equipercentile and modified frequency estimation equating methods under the common-item nonequivalent

More information

Raffaela Wolf, MS, MA. Bachelor of Science, University of Maine, Master of Science, Robert Morris University, 2008

Raffaela Wolf, MS, MA. Bachelor of Science, University of Maine, Master of Science, Robert Morris University, 2008 Assessing the Impact of Characteristics of the Test, Common-items, and Examinees on the Preservation of Equity Properties in Mixed-format Test Equating by Raffaela Wolf, MS, MA Bachelor of Science, University

More information

Item Response Theory (IRT) Analysis of Item Sets

Item Response Theory (IRT) Analysis of Item Sets University of Connecticut DigitalCommons@UConn NERA Conference Proceedings 2011 Northeastern Educational Research Association (NERA) Annual Conference Fall 10-21-2011 Item Response Theory (IRT) Analysis

More information

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY. Yu-Feng Chang

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY. Yu-Feng Chang A Restricted Bi-factor Model of Subdomain Relative Strengths and Weaknesses A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Yu-Feng Chang IN PARTIAL FULFILLMENT

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 24 in Relation to Measurement Error for Mixed Format Tests Jae-Chun Ban Won-Chan Lee February 2007 The authors are

More information

The Discriminating Power of Items That Measure More Than One Dimension

The Discriminating Power of Items That Measure More Than One Dimension The Discriminating Power of Items That Measure More Than One Dimension Mark D. Reckase, American College Testing Robert L. McKinley, Educational Testing Service Determining a correct response to many test

More information

Multidimensional Linking for Tests with Mixed Item Types

Multidimensional Linking for Tests with Mixed Item Types Journal of Educational Measurement Summer 2009, Vol. 46, No. 2, pp. 177 197 Multidimensional Linking for Tests with Mixed Item Types Lihua Yao 1 Defense Manpower Data Center Keith Boughton CTB/McGraw-Hill

More information

Lesson 7: Item response theory models (part 2)

Lesson 7: Item response theory models (part 2) Lesson 7: Item response theory models (part 2) Patrícia Martinková Department of Statistical Modelling Institute of Computer Science, Czech Academy of Sciences Institute for Research and Development of

More information

Equating Subscores Using Total Scaled Scores as an Anchor

Equating Subscores Using Total Scaled Scores as an Anchor Research Report ETS RR 11-07 Equating Subscores Using Total Scaled Scores as an Anchor Gautam Puhan Longjuan Liang March 2011 Equating Subscores Using Total Scaled Scores as an Anchor Gautam Puhan and

More information

Basic IRT Concepts, Models, and Assumptions

Basic IRT Concepts, Models, and Assumptions Basic IRT Concepts, Models, and Assumptions Lecture #2 ICPSR Item Response Theory Workshop Lecture #2: 1of 64 Lecture #2 Overview Background of IRT and how it differs from CFA Creating a scale An introduction

More information

COMPARISON OF CONCURRENT AND SEPARATE MULTIDIMENSIONAL IRT LINKING OF ITEM PARAMETERS

COMPARISON OF CONCURRENT AND SEPARATE MULTIDIMENSIONAL IRT LINKING OF ITEM PARAMETERS COMPARISON OF CONCURRENT AND SEPARATE MULTIDIMENSIONAL IRT LINKING OF ITEM PARAMETERS A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Mayuko Kanada Simon IN PARTIAL

More information

Ability Metric Transformations

Ability Metric Transformations Ability Metric Transformations Involved in Vertical Equating Under Item Response Theory Frank B. Baker University of Wisconsin Madison The metric transformations of the ability scales involved in three

More information

Choice of Anchor Test in Equating

Choice of Anchor Test in Equating Research Report Choice of Anchor Test in Equating Sandip Sinharay Paul Holland Research & Development November 2006 RR-06-35 Choice of Anchor Test in Equating Sandip Sinharay and Paul Holland ETS, Princeton,

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 37 Effects of the Number of Common Items on Equating Precision and Estimates of the Lower Bound to the Number of Common

More information

PREDICTING THE DISTRIBUTION OF A GOODNESS-OF-FIT STATISTIC APPROPRIATE FOR USE WITH PERFORMANCE-BASED ASSESSMENTS. Mary A. Hansen

PREDICTING THE DISTRIBUTION OF A GOODNESS-OF-FIT STATISTIC APPROPRIATE FOR USE WITH PERFORMANCE-BASED ASSESSMENTS. Mary A. Hansen PREDICTING THE DISTRIBUTION OF A GOODNESS-OF-FIT STATISTIC APPROPRIATE FOR USE WITH PERFORMANCE-BASED ASSESSMENTS by Mary A. Hansen B.S., Mathematics and Computer Science, California University of PA,

More information

Overview. Multidimensional Item Response Theory. Lecture #12 ICPSR Item Response Theory Workshop. Basics of MIRT Assumptions Models Applications

Overview. Multidimensional Item Response Theory. Lecture #12 ICPSR Item Response Theory Workshop. Basics of MIRT Assumptions Models Applications Multidimensional Item Response Theory Lecture #12 ICPSR Item Response Theory Workshop Lecture #12: 1of 33 Overview Basics of MIRT Assumptions Models Applications Guidance about estimating MIRT Lecture

More information

The robustness of Rasch true score preequating to violations of model assumptions under equivalent and nonequivalent populations

The robustness of Rasch true score preequating to violations of model assumptions under equivalent and nonequivalent populations University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2008 The robustness of Rasch true score preequating to violations of model assumptions under equivalent and

More information

Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments

Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments Jonathan Templin The University of Georgia Neal Kingston and Wenhao Wang University

More information

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data Research Report Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data Gautam Puhan February 2 ETS RR--6 Listening. Learning. Leading. Chained Versus Post-Stratification

More information

ROBUST SCALE TRNASFORMATION METHODS IN IRT TRUE SCORE EQUATING UNDER COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

ROBUST SCALE TRNASFORMATION METHODS IN IRT TRUE SCORE EQUATING UNDER COMMON-ITEM NONEQUIVALENT GROUPS DESIGN ROBUST SCALE TRNASFORMATION METHODS IN IRT TRUE SCORE EQUATING UNDER COMMON-ITEM NONEQUIVALENT GROUPS DESIGN A Dissertation Presented to the Faculty of the Department of Educational, School and Counseling

More information

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts A Study of Statistical Power and Type I Errors in Testing a Factor Analytic Model for Group Differences in Regression Intercepts by Margarita Olivera Aguilar A Thesis Presented in Partial Fulfillment of

More information

Equating Tests Under The Nominal Response Model Frank B. Baker

Equating Tests Under The Nominal Response Model Frank B. Baker Equating Tests Under The Nominal Response Model Frank B. Baker University of Wisconsin Under item response theory, test equating involves finding the coefficients of a linear transformation of the metric

More information

The Difficulty of Test Items That Measure More Than One Ability

The Difficulty of Test Items That Measure More Than One Ability The Difficulty of Test Items That Measure More Than One Ability Mark D. Reckase The American College Testing Program Many test items require more than one ability to obtain a correct response. This article

More information

Summer School in Applied Psychometric Principles. Peterhouse College 13 th to 17 th September 2010

Summer School in Applied Psychometric Principles. Peterhouse College 13 th to 17 th September 2010 Summer School in Applied Psychometric Principles Peterhouse College 13 th to 17 th September 2010 1 Two- and three-parameter IRT models. Introducing models for polytomous data. Test information in IRT

More information

The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups

The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups University of Iowa Iowa Research Online Theses and Dissertations Summer 2010 The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 23 Comparison of Three IRT Linking Procedures in the Random Groups Equating Design Won-Chan Lee Jae-Chun Ban February

More information

Logistic Regression and Item Response Theory: Estimation Item and Ability Parameters by Using Logistic Regression in IRT.

Logistic Regression and Item Response Theory: Estimation Item and Ability Parameters by Using Logistic Regression in IRT. Louisiana State University LSU Digital Commons LSU Historical Dissertations and Theses Graduate School 1998 Logistic Regression and Item Response Theory: Estimation Item and Ability Parameters by Using

More information

PIRLS 2016 Achievement Scaling Methodology 1

PIRLS 2016 Achievement Scaling Methodology 1 CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally

More information

An Overview of Item Response Theory. Michael C. Edwards, PhD

An Overview of Item Response Theory. Michael C. Edwards, PhD An Overview of Item Response Theory Michael C. Edwards, PhD Overview General overview of psychometrics Reliability and validity Different models and approaches Item response theory (IRT) Conceptual framework

More information

A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating

A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating Tianyou Wang and Michael J. Kolen American College Testing A quadratic curve test equating method for equating

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 31 Assessing Equating Results Based on First-order and Second-order Equity Eunjung Lee, Won-Chan Lee, Robert L. Brennan

More information

USING BAYESIAN TECHNIQUES WITH ITEM RESPONSE THEORY TO ANALYZE MATHEMATICS TESTS. by MARY MAXWELL

USING BAYESIAN TECHNIQUES WITH ITEM RESPONSE THEORY TO ANALYZE MATHEMATICS TESTS. by MARY MAXWELL USING BAYESIAN TECHNIQUES WITH ITEM RESPONSE THEORY TO ANALYZE MATHEMATICS TESTS by MARY MAXWELL JIM GLEASON, COMMITTEE CHAIR STAVROS BELBAS ROBERT MOORE SARA TOMEK ZHIJIAN WU A DISSERTATION Submitted

More information

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin Equivalency Test for Model Fit 1 Running head: EQUIVALENCY TEST FOR MODEL FIT An Equivalency Test for Model Fit Craig S. Wells University of Massachusetts Amherst James. A. Wollack Ronald C. Serlin University

More information

GODFREY, KELLY ELIZABETH, Ph.D. A Comparison of Kernel Equating and IRT True Score Equating Methods. (2007) Directed by Dr. Terry A. Ackerman. 181 pp.

GODFREY, KELLY ELIZABETH, Ph.D. A Comparison of Kernel Equating and IRT True Score Equating Methods. (2007) Directed by Dr. Terry A. Ackerman. 181 pp. GODFREY, KELLY ELIZABETH, Ph.D. A Comparison of Kernel Equating and IRT True Score Equating Methods. (7) Directed by Dr. Terry A. Ackerman. 8 pp. This two-part study investigates ) the impact of loglinear

More information

On the Use of Nonparametric ICC Estimation Techniques For Checking Parametric Model Fit

On the Use of Nonparametric ICC Estimation Techniques For Checking Parametric Model Fit On the Use of Nonparametric ICC Estimation Techniques For Checking Parametric Model Fit March 27, 2004 Young-Sun Lee Teachers College, Columbia University James A.Wollack University of Wisconsin Madison

More information

Multidimensional Computerized Adaptive Testing in Recovering Reading and Mathematics Abilities

Multidimensional Computerized Adaptive Testing in Recovering Reading and Mathematics Abilities Multidimensional Computerized Adaptive Testing in Recovering Reading and Mathematics Abilities by Yuan H. Li Prince Georges County Public Schools, Maryland William D. Schafer University of Maryland at

More information

A Markov chain Monte Carlo approach to confirmatory item factor analysis. Michael C. Edwards The Ohio State University

A Markov chain Monte Carlo approach to confirmatory item factor analysis. Michael C. Edwards The Ohio State University A Markov chain Monte Carlo approach to confirmatory item factor analysis Michael C. Edwards The Ohio State University An MCMC approach to CIFA Overview Motivating examples Intro to Item Response Theory

More information

Likelihood and Fairness in Multidimensional Item Response Theory

Likelihood and Fairness in Multidimensional Item Response Theory Likelihood and Fairness in Multidimensional Item Response Theory or What I Thought About On My Holidays Giles Hooker and Matthew Finkelman Cornell University, February 27, 2008 Item Response Theory Educational

More information

SCORING TESTS WITH DICHOTOMOUS AND POLYTOMOUS ITEMS CIGDEM ALAGOZ. (Under the Direction of Seock-Ho Kim) ABSTRACT

SCORING TESTS WITH DICHOTOMOUS AND POLYTOMOUS ITEMS CIGDEM ALAGOZ. (Under the Direction of Seock-Ho Kim) ABSTRACT SCORING TESTS WITH DICHOTOMOUS AND POLYTOMOUS ITEMS by CIGDEM ALAGOZ (Under the Direction of Seock-Ho Kim) ABSTRACT This study applies item response theory methods to the tests combining multiple-choice

More information

Item Response Theory and Computerized Adaptive Testing

Item Response Theory and Computerized Adaptive Testing Item Response Theory and Computerized Adaptive Testing Richard C. Gershon, PhD Department of Medical Social Sciences Feinberg School of Medicine Northwestern University gershon@northwestern.edu May 20,

More information

Equating of Subscores and Weighted Averages Under the NEAT Design

Equating of Subscores and Weighted Averages Under the NEAT Design Research Report ETS RR 11-01 Equating of Subscores and Weighted Averages Under the NEAT Design Sandip Sinharay Shelby Haberman January 2011 Equating of Subscores and Weighted Averages Under the NEAT Design

More information

The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated Data

The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated Data Int. Statistical Inst.: Proc. 58th World Statistical Congress, 20, Dublin (Session CPS008) p.6049 The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated

More information

Paradoxical Results in Multidimensional Item Response Theory

Paradoxical Results in Multidimensional Item Response Theory UNC, December 6, 2010 Paradoxical Results in Multidimensional Item Response Theory Giles Hooker and Matthew Finkelman UNC, December 6, 2010 1 / 49 Item Response Theory Educational Testing Traditional model

More information

Walkthrough for Illustrations. Illustration 1

Walkthrough for Illustrations. Illustration 1 Tay, L., Meade, A. W., & Cao, M. (in press). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods. doi: 10.1177/1094428114553062 Walkthrough for Illustrations

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 25 A Analysis of a Large-scale Reading Comprehension Test. Dongmei Li Robert L. Brennan August 2007 Robert L.Brennan

More information

Haiwen (Henry) Chen and Paul Holland 1 ETS, Princeton, New Jersey

Haiwen (Henry) Chen and Paul Holland 1 ETS, Princeton, New Jersey Research Report Construction of Chained True Score Equipercentile Equatings Under the Kernel Equating (KE) Framework and Their Relationship to Levine True Score Equating Haiwen (Henry) Chen Paul Holland

More information

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006 LINKING IN DEVELOPMENTAL SCALES Michelle M. Langer A thesis submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Master

More information

Statistical and psychometric methods for measurement: Scale development and validation

Statistical and psychometric methods for measurement: Scale development and validation Statistical and psychometric methods for measurement: Scale development and validation Andrew Ho, Harvard Graduate School of Education The World Bank, Psychometrics Mini Course Washington, DC. June 11,

More information

Comparing Multi-dimensional and Uni-dimensional Computer Adaptive Strategies in Psychological and Health Assessment. Jingyu Liu

Comparing Multi-dimensional and Uni-dimensional Computer Adaptive Strategies in Psychological and Health Assessment. Jingyu Liu Comparing Multi-dimensional and Uni-dimensional Computer Adaptive Strategies in Psychological and Health Assessment by Jingyu Liu BS, Beijing Institute of Technology, 1994 MS, University of Texas at San

More information

The application and empirical comparison of item. parameters of Classical Test Theory and Partial Credit. Model of Rasch in performance assessments

The application and empirical comparison of item. parameters of Classical Test Theory and Partial Credit. Model of Rasch in performance assessments The application and empirical comparison of item parameters of Classical Test Theory and Partial Credit Model of Rasch in performance assessments by Paul Moloantoa Mokilane Student no: 31388248 Dissertation

More information

Introduction To Confirmatory Factor Analysis and Item Response Theory

Introduction To Confirmatory Factor Analysis and Item Response Theory Introduction To Confirmatory Factor Analysis and Item Response Theory Lecture 23 May 3, 2005 Applied Regression Analysis Lecture #23-5/3/2005 Slide 1 of 21 Today s Lecture Confirmatory Factor Analysis.

More information

Maryland High School Assessment 2016 Technical Report

Maryland High School Assessment 2016 Technical Report Maryland High School Assessment 2016 Technical Report Biology Government Educational Testing Service January 2017 Copyright 2017 by Maryland State Department of Education. All rights reserved. Foreword

More information

Chapter 1 A Statistical Perspective on Equating Test Scores

Chapter 1 A Statistical Perspective on Equating Test Scores Chapter 1 A Statistical Perspective on Equating Test Scores Alina A. von Davier The fact that statistical methods of inference play so slight a role... reflect[s] the lack of influence modern statistical

More information

A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example

A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example Robert L. Brennan CASMA University of Iowa June 10, 2012 On May 3, 2012, the author made a PowerPoint presentation

More information

Application of Item Response Theory Models for Intensive Longitudinal Data

Application of Item Response Theory Models for Intensive Longitudinal Data Application of Item Response Theory Models for Intensive Longitudinal Data Don Hedeker, Robin Mermelstein, & Brian Flay University of Illinois at Chicago hedeker@uic.edu Models for Intensive Longitudinal

More information

Whats beyond Concerto: An introduction to the R package catr. Session 4: Overview of polytomous IRT models

Whats beyond Concerto: An introduction to the R package catr. Session 4: Overview of polytomous IRT models Whats beyond Concerto: An introduction to the R package catr Session 4: Overview of polytomous IRT models The Psychometrics Centre, Cambridge, June 10th, 2014 2 Outline: 1. Introduction 2. General notations

More information

Development and Calibration of an Item Response Model. that Incorporates Response Time

Development and Calibration of an Item Response Model. that Incorporates Response Time Development and Calibration of an Item Response Model that Incorporates Response Time Tianyou Wang and Bradley A. Hanson ACT, Inc. Send correspondence to: Tianyou Wang ACT, Inc P.O. Box 168 Iowa City,

More information

Monte Carlo Simulations for Rasch Model Tests

Monte Carlo Simulations for Rasch Model Tests Monte Carlo Simulations for Rasch Model Tests Patrick Mair Vienna University of Economics Thomas Ledl University of Vienna Abstract: Sources of deviation from model fit in Rasch models can be lack of unidimensionality,

More information

examples of how different aspects of test information can be displayed graphically to form a profile of a test

examples of how different aspects of test information can be displayed graphically to form a profile of a test Creating a Test Information Profile for a Two-Dimensional Latent Space Terry A. Ackerman University of Illinois In some cognitive testing situations it is believed, despite reporting only a single score,

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. A Multinomial Error Model for Tests with Polytomous Items

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. A Multinomial Error Model for Tests with Polytomous Items Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 1 for Tests with Polytomous Items Won-Chan Lee January 2 A previous version of this paper was presented at the Annual

More information

The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence

The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence A C T Research Report Series 87-14 The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence Terry Ackerman September 1987 For additional copies write: ACT Research

More information

Dimensionality Assessment: Additional Methods

Dimensionality Assessment: Additional Methods Dimensionality Assessment: Additional Methods In Chapter 3 we use a nonlinear factor analytic model for assessing dimensionality. In this appendix two additional approaches are presented. The first strategy

More information

Research on Standard Errors of Equating Differences

Research on Standard Errors of Equating Differences Research Report Research on Standard Errors of Equating Differences Tim Moses Wenmin Zhang November 2010 ETS RR-10-25 Listening. Learning. Leading. Research on Standard Errors of Equating Differences Tim

More information

Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA

Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA Topics: Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA What are MI and DIF? Testing measurement invariance in CFA Testing differential item functioning in IRT/IFA

More information

The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model

The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model Annual Meeting of the National Council on Measurement in Education Vancouver, B.C. Jason L. Meyers

More information

Some Issues In Markov Chain Monte Carlo Estimation For Item Response Theory

Some Issues In Markov Chain Monte Carlo Estimation For Item Response Theory University of South Carolina Scholar Commons Theses and Dissertations 2016 Some Issues In Markov Chain Monte Carlo Estimation For Item Response Theory Han Kil Lee University of South Carolina Follow this

More information

Equivalency of the DINA Model and a Constrained General Diagnostic Model

Equivalency of the DINA Model and a Constrained General Diagnostic Model Research Report ETS RR 11-37 Equivalency of the DINA Model and a Constrained General Diagnostic Model Matthias von Davier September 2011 Equivalency of the DINA Model and a Constrained General Diagnostic

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 41 A Comparative Study of Item Response Theory Item Calibration Methods for the Two Parameter Logistic Model Kyung

More information

ESTIMATION OF IRT PARAMETERS OVER A SMALL SAMPLE. BOOTSTRAPPING OF THE ITEM RESPONSES. Dimitar Atanasov

ESTIMATION OF IRT PARAMETERS OVER A SMALL SAMPLE. BOOTSTRAPPING OF THE ITEM RESPONSES. Dimitar Atanasov Pliska Stud. Math. Bulgar. 19 (2009), 59 68 STUDIA MATHEMATICA BULGARICA ESTIMATION OF IRT PARAMETERS OVER A SMALL SAMPLE. BOOTSTRAPPING OF THE ITEM RESPONSES Dimitar Atanasov Estimation of the parameters

More information

The performance of estimation methods for generalized linear mixed models

The performance of estimation methods for generalized linear mixed models University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2008 The performance of estimation methods for generalized linear

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Linear Equating Models for the Common-item Nonequivalent-Populations Design Michael J. Kolen and Robert L. Brennan American College Testing Program

Linear Equating Models for the Common-item Nonequivalent-Populations Design Michael J. Kolen and Robert L. Brennan American College Testing Program Linear Equating Models for the Common-item Nonequivalent-Populations Design Michael J. Kolen Robert L. Brennan American College Testing Program The Tucker Levine equally reliable linear meth- in the common-item

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Hierarchical Cognitive Diagnostic Analysis: Simulation Study

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Hierarchical Cognitive Diagnostic Analysis: Simulation Study Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 38 Hierarchical Cognitive Diagnostic Analysis: Simulation Study Yu-Lan Su, Won-Chan Lee, & Kyong Mi Choi Dec 2013

More information

Comparison between conditional and marginal maximum likelihood for a class of item response models

Comparison between conditional and marginal maximum likelihood for a class of item response models (1/24) Comparison between conditional and marginal maximum likelihood for a class of item response models Francesco Bartolucci, University of Perugia (IT) Silvia Bacci, University of Perugia (IT) Claudia

More information

International Black Sea University. Mehtap Ergüven

International Black Sea University. Mehtap Ergüven International Black Sea University Faculty of Computer Technologies and Engineering Elaboration of an Algorithm of Detecting Tests Dimensionality Mehtap Ergüven Ph.D. Dissertation for the degree of Engineering

More information

Because it might not make a big DIF: Assessing differential test functioning

Because it might not make a big DIF: Assessing differential test functioning Because it might not make a big DIF: Assessing differential test functioning David B. Flora R. Philip Chalmers Alyssa Counsell Department of Psychology, Quantitative Methods Area Differential item functioning

More information

Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University.

Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University. This file is part of the following reference: Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University. Access to this file is available

More information

COURSE OUTLINE General Physics I

COURSE OUTLINE General Physics I Butler Community College Science, Technology, Engineering, and Math Division Robert Carlson Revised Fall 2008 Implemented Spring 2009 Textbook Update Fall 2015 COURSE OUTLINE General Physics I Course Description

More information

Diagnostic Classification Models: Psychometric Issues and Statistical Challenges

Diagnostic Classification Models: Psychometric Issues and Statistical Challenges Diagnostic Classification Models: Psychometric Issues and Statistical Challenges Jonathan Templin Department of Educational Psychology The University of Georgia University of South Carolina Talk Talk Overview

More information

Comparing IRT with Other Models

Comparing IRT with Other Models Comparing IRT with Other Models Lecture #14 ICPSR Item Response Theory Workshop Lecture #14: 1of 45 Lecture Overview The final set of slides will describe a parallel between IRT and another commonly used

More information

IOWA EQUATING SUMMIT

IOWA EQUATING SUMMIT Co-Hosted by ACTNext, ACT, and The University of Iowa Center for Advanced Studies in Measurement and Assessment IOWA EQUATING SUMMIT LINKING COMPARABILITY EQUATING SEPTEMBER 13, 2017 8:30AM - 5PM FERGUSON

More information

Computerized Adaptive Testing With Equated Number-Correct Scoring

Computerized Adaptive Testing With Equated Number-Correct Scoring Computerized Adaptive Testing With Equated Number-Correct Scoring Wim J. van der Linden University of Twente A constrained computerized adaptive testing (CAT) algorithm is presented that can be used to

More information

A Comparison of Bivariate Smoothing Methods in Common-Item Equipercentile Equating

A Comparison of Bivariate Smoothing Methods in Common-Item Equipercentile Equating A Comparison of Bivariate Smoothing Methods in Common-Item Equipercentile Equating Bradley A. Hanson American College Testing The effectiveness of smoothing the bivariate distributions of common and noncommon

More information

Introduction to Confirmatory Factor Analysis

Introduction to Confirmatory Factor Analysis Introduction to Confirmatory Factor Analysis Multivariate Methods in Education ERSH 8350 Lecture #12 November 16, 2011 ERSH 8350: Lecture 12 Today s Class An Introduction to: Confirmatory Factor Analysis

More information

Observed-Score "Equatings"

Observed-Score Equatings Comparison of IRT True-Score and Equipercentile Observed-Score "Equatings" Frederic M. Lord and Marilyn S. Wingersky Educational Testing Service Two methods of equating tests are compared, one using true

More information

JORIS MULDER AND WIM J. VAN DER LINDEN

JORIS MULDER AND WIM J. VAN DER LINDEN PSYCHOMETRIKA VOL. 74, NO. 2, 273 296 JUNE 2009 DOI: 10.1007/S11336-008-9097-5 MULTIDIMENSIONAL ADAPTIVE TESTING WITH OPTIMAL DESIGN CRITERIA FOR ITEM SELECTION JORIS MULDER AND WIM J. VAN DER LINDEN UNIVERSITY

More information

IRT Model Selection Methods for Polytomous Items

IRT Model Selection Methods for Polytomous Items IRT Model Selection Methods for Polytomous Items Taehoon Kang University of Wisconsin-Madison Allan S. Cohen University of Georgia Hyun Jung Sung University of Wisconsin-Madison March 11, 2005 Running

More information

A Note on the Equivalence Between Observed and Expected Information Functions With Polytomous IRT Models

A Note on the Equivalence Between Observed and Expected Information Functions With Polytomous IRT Models Journal of Educational and Behavioral Statistics 2015, Vol. 40, No. 1, pp. 96 105 DOI: 10.3102/1076998614558122 # 2014 AERA. http://jebs.aera.net A Note on the Equivalence Between Observed and Expected

More information

ABSTRACT. Yunyun Dai, Doctor of Philosophy, Mixtures of item response theory models have been proposed as a technique to explore

ABSTRACT. Yunyun Dai, Doctor of Philosophy, Mixtures of item response theory models have been proposed as a technique to explore ABSTRACT Title of Document: A MIXTURE RASCH MODEL WITH A COVARIATE: A SIMULATION STUDY VIA BAYESIAN MARKOV CHAIN MONTE CARLO ESTIMATION Yunyun Dai, Doctor of Philosophy, 2009 Directed By: Professor, Robert

More information

User's Guide for SCORIGHT (Version 3.0): A Computer Program for Scoring Tests Built of Testlets Including a Module for Covariate Analysis

User's Guide for SCORIGHT (Version 3.0): A Computer Program for Scoring Tests Built of Testlets Including a Module for Covariate Analysis Research Report User's Guide for SCORIGHT (Version 3.0): A Computer Program for Scoring Tests Built of Testlets Including a Module for Covariate Analysis Xiaohui Wang Eric T. Bradlow Howard Wainer Research

More information

Department of Mathematics

Department of Mathematics Department of Mathematics FACULTY Professors Jungck, McAsey (chair), Szeto, Timm; Associate Professors Bedenikovic, Delgado, Hahn, Kasube, Mou, Nanyes, Quigg, Xue; Assistant Professors Haverhals, Lang;

More information

Paradoxical Results in Multidimensional Item Response Theory

Paradoxical Results in Multidimensional Item Response Theory Paradoxical Results in Multidimensional Item Response Theory Giles Hooker, Matthew Finkelman and Armin Schwartzman Abstract In multidimensional item response theory MIRT, it is possible for the estimate

More information

Ensemble Rasch Models

Ensemble Rasch Models Ensemble Rasch Models Steven M. Lattanzio II Metamatrics Inc., Durham, NC 27713 email: slattanzio@lexile.com Donald S. Burdick Metamatrics Inc., Durham, NC 27713 email: dburdick@lexile.com A. Jackson Stenner

More information

Summer Review Packet. for students entering. AP Calculus BC

Summer Review Packet. for students entering. AP Calculus BC Summer Review Packet for students entering AP Calculus BC The problems in this packet are designed to help you review topics that are important to your success in AP Calculus. Please attempt the problems

More information

Experimental designs for multiple responses with different models

Experimental designs for multiple responses with different models Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:

More information

Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring

Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring Journal of Educational and Behavioral Statistics Fall 2005, Vol. 30, No. 3, pp. 295 311 Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring

More information

A Use of the Information Function in Tailored Testing

A Use of the Information Function in Tailored Testing A Use of the Information Function in Tailored Testing Fumiko Samejima University of Tennessee for indi- Several important and useful implications in latent trait theory, with direct implications vidualized

More information

Use of Robust z in Detecting Unstable Items in Item Response Theory Models

Use of Robust z in Detecting Unstable Items in Item Response Theory Models A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information