The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model

Size: px
Start display at page:

Download "The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model"

Transcription

1 The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model Annual Meeting of the National Council on Measurement in Education Vancouver, B.C. Jason L. Meyers Stephen Murphy Joshua Goodman Ahmet Turhan April 2012

2 The Impact of Item Position Change on Item Parameters and Common Item Equating Results under the 3PL Model Introduction Operational testing programs employing item response theory (IRT) applications benefit from of the property of item parameter invariance whereby item parameter estimates obtained from one sample can be applied to other samples (when the underlying assumptions are satisfied). In theory, this feature allows for applications such as computer-adaptive testing (CAT) and test pre-equating. In practice, item parameter invariance can be threatened by a number of factors including context effects, item position effects, instructional effects, variable sample sizes, and other sources of item parameter drift that are not formally modeled in IRT applications. Many researchers have documented the existence and investigated the influence of such item-level effects (Whitely & Dawis, 1976; Yen, 1980; Klein & Bolus, 1983; Kingston & Dorans, 1984; Rubin & Mott, 1984; Leary & Dorans, 1985). One situation that can threaten item parameter invariance is shifting item locations across test administrations. Often a concerted effort is made to keep the items used for equating in identical or very similar positions within the test from one use to the next. However, this can be difficult to achieve due to test security concerns, test disclosure requirements, limitations in the item bank, or test content and design considerations. (e.g., items being associated with reading passages). In fact, as curriculum standards are often updated to reflect new learning content that legislators, administrators, and teachers deem important for student success, the test blueprint also needs adjustment to reflect the new content standard emphasis. These modifications to the test blueprint can potentially tax item banks as item development and field testing attempt to catch up to the new curriculum design. These new standards will need to be reflected in the anchor item set for equating purposes (Kolen &Brennan, 2004). 1

3 In their 2009 paper, Meyers et al. modeled the impact of item position change (IPC) on changes in Rasch Item Difficulty using the scaling and equating procedures employed by one specific K-12 testing program. This study helped assess the practical impact of IPC between test administrations. However, as they acknowledged, the generalizability of their findings is unclear because the study evaluated only one large K 12 testing program that utilize[d] a particular measurement model, equating procedures, test administration procedures (i.e., an untimed test), and test construction procedures. Objectives/Purpose This study extends the Meyers et al study by investigating the impact of IPC in the context of operational testing programs that employ the 3PL model, alternative equating procedures, and different item re-use policies. The purpose here is to expand upon the previous research by investigating the impact of IPC, sample size, subject area, grade, elapsed time between item uses, and number of previous uses on change in the IRT a-, b-, and c-parameters as well as the resulting D 2 statistic (Murphy et al., 2010) defined as the weighted sum of the squared deviation between the Item Characteristic Curves. The variables included in this investigation were selected because they are often ones influenced by policies that may constrain test construction and administration in large-scale assessment programs. Analysis Using Real Data The data used for modeling change in item parameters between test administrations and in post-equating results came from the 2010 and 2011 operational administrations of assessment programs within two states. One state is located in the Southwestern United States and the other is situated within the Southeastern United States. In both cases, students take the criterion- 2

4 referenced standardized tests in reading, mathematics, writing, social studies, and science in the lower grade levels, and end-of-course tests at the high school level at the end of the year (i.e., summative assessments). These tests are administered in both paper-and-pencil and online formats and are untimed. Depending on grade and subject, the tests contain between 45 and 70 multiple choice items. The writing assessments also contain one open- ended constructed response item. For these large-scale testing programs, as is typically done, newly developed items are field tested by embedding them in operational test forms so that they are indistinguishable from operational items. In the case of these tests, the 3PL model and the Stocking and Lord (1983) test characteristic curve method are used to equate operational and field test item parameters to the base scale. Descriptive statistics for the items appearing in each grade and subject are presented in Table 1 and reasonably resemble most K-12 large-scale summative assessment programs. Please note that data from multiple administrations were used for some grades and subjects. In addition, all end-of-course tests are indicated by Grade 12 in the table. However, that does not necessarily mean these tests are administered at grade 12, but simply at any high school grade. Tables 2 and 3 present the field test (or prior test use) and equated operational test difficulty (bparameters) and discrimination (a parameter) values, respectively for the items included in these analyses. On average, the field test items and operational test items were of roughly equal difficulty. Despite concerted efforts to the contrary, the positions of the items within the test form often change between field-testing and operational use or between non-(horizontal) linking (anchor) operational items and these items most recent use. Table 4 indicates, across all tests analyzed, how far items shifted between field test or prior operational use and the current 3

5 operational test administrations. Note that for a given grade and subject, the field test item set and operational test contain the same number of items. Item position change is calculated as position on the field test (or prior use test) minus position on the operational test. So a negative number indicates that an item has moved closer toward the beginning of the test and a positive number indicates that an item has moved closer toward the end of the test. As an example, an item appearing in position 10 on both the field test and 10 on the operational test would have a position change of 0. As one would expect, the overwhelming majority of items shifted less than 5 positions between administrations. However, roughly eight percent of the items shifted more than 20 positions. Table 5 indicates how much change in the discrimination values (a- parameter) was observed between field test and operational test as a function of how many positions the item shifted between administrations. In general, a-parameters changed more the further the item shifted in position between administrations. Table 6 presents this same information for the change in item difficulty (b-parameter). In general, b-parameters changed more the further the item shifted in position between administrations. However, the largest average difficulty change was for items that moved less than 5 positions. This is most likely an artifact based on sample size, as the overwhelming majority of items included in these analyses moved between 0 and five positions towards the beginning of the test. This average is also heavily affected by two items (outliers) that had enormous b-value changes between administrations. Removing these items resulted in an average b-value change of rather than However, the parameters for these two outliers were retained in these analyses since these items are included in the actual item banks of these two assessments. For these two reasons, this finding should be interpreted with caution. 4

6 Table 7 presents descriptive statistics for the D 2 statistic as a function of change in item position. Recall that the D 2 statistic represents the weighted sum of the differences between the (previous and current use) item characteristic curves. The D 2 was considered an important outcome for this study, not only because it takes into account both the a- and b- parameters and their interaction, but because for the assessment programs in these two states D 2 statistics are used to determine which items to retain in the final equating set. Items with large D 2 values are often removed from the equating item set in the calculation of the Stocking & Lord constants. The results displayed in Table 7 suggest a very clear pattern. D 2 statistics increase in magnitude the further items move from their field test locations. As a group, the results from these analyses suggest that both item parameters and D 2 statistics are negatively impacted by change in item position. In addition to looking at change in item position, the amount of time elapsing between the field test or most recent operational use and the current operational test was investigated to determine potential impact on the observed item parameter and D 2 values. Although the typical process for these large-scale assessments is to include on operational test forms items that were field tested one year prior, this is often not possible for a variety of reasons, such as blueprint changes, item rejections at data review, or a variety of content-based requirements such as clueing, variety, and clanging. Table 8 provides a breakdown for each test of how many years previous to their current operational use the items were last used operationally. As one might expect, the majority (approximately 66%) of items included in these analyses were previously tested either within the same year (e.g., spring 2010 field test and winter 2010 operational test) or one year prior to their current operational use. However, there were a handful of items that were last used as many as seven years prior to their current use. Given previous research on item drift, 5

7 one might expect that item parameters would naturally shift over time due to things like changes in curriculum or instructional methods. Finally, we investigated item exposure rates, the number of times an item has been previously used, to assess any potential impact on the observed changes in the item parameters or in D 2 values. One might expect item level statistics to shift slightly the more often that items are exposed to the testing population. In large-scale established statewide testing programs such as those included in these analyses, non-linking items are often re-used due to limitations in the number of items able to be field-tested or limited item banks, which increases security concerns around the potential for increased item exposure. Efforts to restrict repeated item use from occurring are often further hampered by such circumstances as blueprint changes, content-driven constraints, and greater than expected loss of items at data review. Table 9 presents a breakdown of the number of previous operational uses for items used in these analyses. These values do not include the item field test administration, so a value of 0, for example, means that the item had never appeared on an operational test form before the current one. The majority (approximately 90%) of items had either been never used previously or had appeared on only one operational form previous to the current use. However, there are some instances of multiple uses, with the highest being six previous occurrences. Regressions To model the change in item parameters and D 2 values as a result of the factors considered in this study, several multiple regressions were performed. First, across all grades and subjects, changes in the IRT parameters and D 2 statistics were modeled as a function of IPC, test subject, grade, number of prior uses, and elapsed time between administrations. Overall, the 6

8 results from these regressions were unremarkable and suggested that the variables of interest had little to no impact on the outcomes of interest. For change in the a- parameter, the only variable that proved to have a statistically significant impact was item position change. That relationship was modeled as: Δ a = (IPC) (1) However, though a statistically significant predictor, IPC only accounted for 11% of the variation in change in the discrimination value. In modeling the change in item difficulty, surprisingly neither the overall regression nor any of the explanatory variables were found to be statistically significant. The D 2 statistic was the most heavily impacted by the variables of interest. Although only accounting for just over 2% of the variation in D 2 values and seemingly not significantly impacted by item position change, the resulting model was: D 2 = (time) (uses) (2) Not only were results of these aggregate regressions unremarkable from a research perspective, they also did not correspond with our operational experiences and observations. As such, as a next step, regressions were performed individually for each grade and subject to determine if there were patterns that might be masked when performing the regressions on the total set of tests. Tables 10 and 11 provide the regression results for the impact of the variables of interest on the change in the a- and b- parameter, respectively. Table 12 presents this same information for D 2. Values highlighted in red indicate effects that were significant at the 0.05 alpha level. As one can see, these relationships varied considerably across test and subject. Although in most 7

9 cases, the R 2 values suggested that our variables were not able to account for much of the variation in change in either a- or b- parameters, it appears as if item position change more heavily impacted the outcomes of interest in the reading tests. This result makes sense, given that these are passage-based tests and items are more likely to experience more dramatic position shifts between administrations. Greater position shifts are likely to occur in passage-based tests due to the restriction that a set of items must always be placed with the passages thereby restricting flexibility to move items during test construction in order to limit item position shifts. Simulation Study To further investigate the effects of IPC in the context of the equating procedures used in these testing programs, a simulation study was conducted. To mimic a situation where the factors of interest in this study are most likely to affect the equating process, the types of tests that displayed shifts in item parameters that were associated with the factors of interest in the real data analysis (i.e., passage-based reading tests), served as the basis for the simulation. Specifically, we modeled the shift in a- and b- parameters associated with three item-level factors (item position change, time since field-test administration, and number of administrations in which the item appeared previously) using the results of the multiple regressions from the grades 3 through 8 reading assessments in one of the two states modeled in the real data analysis. Two test-level factors (sample size and overall field test design), though not explicitly included in the real data analysis, were also included in the simulation. Simulation Conditions With the exception of the sample size (which contained three levels), each simulation condition included in the study consisted of two levels: a level that reflects best practices and a level that reflects extreme practices. All conditions were fully crossed for a total of 48 unique 8

10 combinations of conditions. Brief descriptions of each condition included in the study (and the levels within each condition) are described below. Item Position Change (IPC): As described in the real data analysis section, IPC is defined as the item position in the test during field-testing or most recent operational use compared to its placement in the current operation test. This condition consisted of two levels. The first condition reflected only minor changes (minor shifts) in IPC since prior use/field testing and represents best practices in test construction. In this condition 10 out of 50 operational items were placed in the same location as their prior use/field test position. The remaining items were distributed evenly so that they were placed within one to five positions of their field-test locations. The second condition (major shifts) represents a situation where items tend to be placed in locations distant from their initial placement. In this extreme condition, 20 operational items were placed within nine positions of their prior use/field test position, and the remaining 30 items were placed within 10 to 15 position of their prior use/field test location. Elapsed Time since Field-testing (Time): Elapsed time is the number of years between the prior operational use or field-testing of the item and the current operational use of the item. Again, two levels of the condition were simulated. The first level (short gap) simulated a short gap between administrations where 40 out of 50 items were used within 2 years of prior use/field-testing and the remaining 10 items had a gap of either 3 or 4 years between administrations. Conversely, the second level (long gap) within this factor represents the situation in which there is typically a long gap between prior use/fieldtesting and operational use. Of the 50 items, 40 had a gap of 3 or 4 years and the remaining ten had a gap of 1 or 2 years between administrations. 9

11 Previous operational uses (Use): Two levels of the condition were simulated. The first level (light usage) simulates a test with items that have a light usage before they are retired, where 40 out of 50 items are used no more than twice and the remaining 10 items are used 3 or 4 times. The second level (heavy usage) of this condition represents a test where items included have been used heavily. Here, 40 of the items were used three or four times and 10 items used either once or twice. Field-test design (Design): This is a condition applied at the test level and was intended reflect the impact that different field test designs might have on the outcomes of interest. In the first condition, field test items would be truly embedded throughout the operational form. As a result, when these items are placed as operational items, they would not shift systematically in one direction or the other. To simulate this, after calculating the amount of item position change, the sign of this change was randomly assigned. The second field test design assumes that in this passage based test, all field test items are placed as the last item set on the operational form. As a result, when items are placed for operational use, any shifts in item position will be unidirectional. To simulate this condition, the item position changes generated were all left as positive numbers. Sample Size: The last test level condition in the simulation used three different sample size combinations. All three sample size conditions were selected as to introduce error into the simulations that would be reflective of the amount estimation error that would occur through the calibration of the items under different sample sizes. The three levels, consisting of both field-test and operational sample size were: 500 (FT)/2000 (OP) (small) 1000/5000 (medium), and 2000/10,000 (large). 10

12 Using IRT item parameter estimates from a single reading assessment as the true item parameters for the simulations, the following general steps were repeated 100 times for each combination of conditions: Step one: Create a set of field-test item parameter estimates by adding estimation error, using equations 3 and 4 below, components to the true a- and b- item parameters assuming a field test sample size. (3) (4), where the information functions I(a) and I(b) are calculated over 41 evenly spaced quadrature points between -4.0 and 4.0 and weighted, assuming a standard normal distribution, using the expected proportion of students given the sample size at each point. Step two: Construct a test by randomly selecting a total of 50 items from the field-test items created in step one. Step three: Create a set of operational item parameter estimates by first defining estimation error, using equations 3 and 4, components and adding them to the true a- and b- item parameters assuming an operational sample size. Then, using equation 5 or 11

13 6 1 and the values for each experimental condition, the newly calculated value was entered into Equation 1 to create the final form item parameter estimates. The final operational parameter values for each item are created by adding the true parameter values and the two error components computed in the steps above. (5) (6) Step four: Estimate the Stocking and Lord scaling constants required to place the operational parameter values on the same scale as the field-test items and the rescale the operational parameter values. For each condition in the study, the mean and standard deviation of both constants, the average difference between field-test and rescaled a- and b- parameters, and the average d 2 values are collected. Step five: Using the actual scale score transformation constants and cut scores from the same test that was used to supply the true parameters, create two Raw-score to Scalescore (RSSS) tables that includes performance level classifications by applying the cut scores associated with the test that supplied the true item parameter estimates. One table is based on the field-test item parameters and the second table is based on the rescaled operational item parameters. Within each condition, measures reflecting 1 These equations use the regression intercepts and coefficients estimated by using only the 3 to 8 reading assessment for a single state, as described early in this section 12

14 differences in scale scores (i.e., BIAS and RMSD) and differences in classification are collected. Simulation Study Results D 2 Table 13 presents mean D 2 statistics across items and replications for the study conditions. The results indicate that D 2 values were most heavily affected by item position change, with values being nearly seven times as large in the major shifts conditions as in the similar conditions with minor shifts. The average D 2 value for the minor shift conditions was versus for the major shift conditions. The field-test design condition had a smaller impact of the magnitude of D 2, with the embedded condition uniformly smaller across all conditions. D 2 did not appear to be heavily impacted by sample size. The average D 2 for the small sample size condition was compared to and for the medium and large conditions, respectively. Elapsed time did have a noticeable impact on the D 2 values, with the short gap conditions having an average D 2 value of compared to in the long gap conditions. The number of previous uses also had an observable affect on D 2. Light re-use conditions had an average D 2 value of , compared to in the heavy re-use conditions. Though extremely small in magnitude, one might have expected the opposite pattern to emerge. Stocking and Lord Constants Tables 14a and 14b present the average Stocking and Lord multiplicative (A) and additive (B) constants across the 100 iterations. Keep in mind, that under conditions were the tests do not shift in difficulty and the populations are equivalent, these values would be A = 1 and B = 0. 13

15 While tests were not created to be strictly parallel, the items for each iteration were selected systematically to be very similar and in such a way that each form would not be systematically easier or harder. Thus, comparing to the baseline values of A=1 and B=0 gives a reasonable frame of reference when interpreting the results of the simulation study. As was observed for the other measures of interest, IPC and the field test design had a dramatic impact on the magnitude and distribution of the scaling constants (particularly the additive constant) and sample size had a negligible impact on the magnitude and distribution of the constants. Figures 1 and 2 show the distribution of the constants across each replication of the study, collapsed over sample size conditions. Across all conditions, the constants for the embedded field-test design were closer to the benchmark values of 1.0 and 0.0 than were the values of the constants in the fixed field-test condition. This is particularly noticeable for the additive constants when IPC are considered major shifts. However, the spread of the constants across replications was markedly larger under the embedded field-test design when compared to the fixed field test design. Elapsed time had a small impact on the Stocking and Lord constant values. For short gap conditions, the mean A value was compared to for long gap conditions. The average B value was for short gap conditions versus for long gap conditions. Number of previous uses also had a minor impact on the additive scaling constant, but little impact on the multiplicative constant. Light re-use conditions had an average A value was , compared to in the heavy re-use conditions. Finally, light re-use conditions had an average B value of , compared to in the heavy re-use conditions. 14

16 Scale Score Shifts BIAS and RMSD Tables 15a and 15b present the average scale scores by study condition for scoring tables generated using the field test parameters and operational test parameters for the two different field test designs. In addition, and perhaps more interesting, for each condition the average difference between the field test and operational scale scores (BIAS) and the root mean squared difference (RMSD) is provided. Both BIAS and RMSD were calculated using the differences between the scale scores associated with each possible raw score that were produced using the field-test or rescaled operational item parameters. An estimate of BIAS (the simple mean difference between scale scores at each raw score point) and RMSD (square root of the average squared difference) were produced for each replication and then aggregated across. As was observed for D 2, sample size did not play a large role in the size of the observed scale score shifts. BIAS did decrease as sample sized increased ( , , and for the small, medium and large conditions respectively), though the RMSD was unchanged across all three sample size conditions ( for all three conditions). The magnitude of the observed BIAS was most heavily influenced by changes in item position, though as one might expect, the unidirectional changes in item position simulated in the fixed field-test design condition systematically created errors in the same direction. BIAS under this condition was always negative and increased in magnitude as item position change and time since last administration increased. Observed BIAS under the embedded field-test condition was less extreme than the fixed field-test. In general, when item position change was minor, the BIAS under the embedded condition tended to be small and negative. When the shifts in item position were large, BIAS in the embedded conditions increased in magnitude and was positive. 15

17 Figure 3 displays the magnitude of RMSD for the combinations of experimental conditions collapsed over sample sizes. RMSD was very large under the large shift item position change condition across all levels of all other conditions. The embedded field test condition showed a slightly smaller RMSD except when the item position change was large and the gaps between administrations was short. Similar to what was observed with D 2, the magnitude of RMSD did increase to a lesser degree when the gap between was longer. Achievement Level Changes Table 16 presents the achievement level change for each condition, designated as the number and percent of scale scores that remained assigned to the same achievement level after equating (change = 0) as well as the number that moved down one (-1) or up one (+1) achievement level after equating. The vast majority of scaled scores (51 per test * 100 iterations) maintained their categorization after equating across all conditions. Although still a pretty minor difference, change in item position had the largest observed impact on the number of scale scores maintaining their association with achievement level. For the minor shift conditions, 99.33% remained unchanged compared to 98.01% remaining unchanged in the major shift conditions. The percent of scale scores showing change in the performance level for the minor shift conditions was 0.66 compared to 1.99 for the major shift conditions. Elapsed time had a negligibly small observed impact, with an average of 98.96% of the scores maintaining their relationship with achievement levels in the short gap conditions compared to 98.75% in the long gap conditions, though it should be mentioned that the highest rate of performance level changes occurred when the long gap condition was coupled with large shifts on the IPC condition (where 2.83% of scale score are associated with a position change). Finally, number of previous uses had a negligible impact on these classifications with mean 16

18 values of 98.28% for the light re-use conditions and 98.66% for the heavy re-use conditions. The two field test design conditions performed very similarly in terms of changes in performance classification, with the embedded condition performing slightly, but systematically, better than the fixed condition. As seen in all other comparisons, sample size had virtually no impact on these classifications, with average values of 98.44%, 98.38%, and 98.58% for the small, medium, and large sample size conditions, respectively. While the numbers of changes in performance levels which are the differences that would most matter to students were notably small across all conditions, a better reflection of the impact on performance classification should focus on the shifts only at the cut scores rather than all the scale score points. The Cuts Change column in Table 16 contains the percent of cut scores (i.e., the raw score associated with the scale score cut-point), across all iterations, that changed. The general patterns described above remain unchanged (i.e., large shifts in item position are associated with the largest percent of cut scores that change, with time and usage having observable but smaller impact on cut score changes) with one exception: the differences in the two field-test design becomes much more evident with the embedded condition outperforming the fixed condition. Discussion and Conclusion This study was conducted to add to the body of existing research in the area of item position effects by directly extending the Meyers et al. (2009) study. The original study evaluated a specific testing situation in which the Rasch model was used. Second, the whole test served as a common item set to derive an equating constant representing the mean difference in b-values between field test and operational test items. Third, all (or nearly all of the items) placed on an operational test were field tested the previous year, and fourth, items were never re- 17

19 used operationally due to item disclosure requirements. To build on that body of research and assess the generalizability of the original results, the current study modeled a testing situation in which the 3PL model is used, an internal common item anchor set and the Stocking and Lord procedure is employed to equate operational test forms, items are occasionally to frequently reused across administrations, and more than one year (testing cycle) occasionally to frequently passes between one test occurrence (either field test or operational test) and the current administration. These selected variables are commonly confronted in operating large-scale testing programs and decisions are often required due to bank limitations, blueprint updates, and competing policies. Prior to discussing the key findings, there are a few potential limitations worth noting. First, the simulations were based on a real world worst-case scenario in large-scale assessment. While this optimizes the likelihood of findings effects if they were to exist, caution must be taken when interpreting these results in light of different testing conditions. Second, the real world analyses, while extending previous research, only included data from a very limited sample of testing programs with an established set of test administration policies. Consequently, the actual policies of different testing programs could be more or less stringent than the policies of the programs in this study. The result could be to reduce or expand the relationships found in the real-world data and thereby impacting the results of the simulations. Despite these limitations, the results of this study have important practical and research implications. In general, the main findings from the 2009 study were replicated here. Results of both the real data analysis and the simulation study indicated that large changes in item positions between administrations can have a dramatic impact on both the observed item parameters and the subsequent equating results. In the real data analysis, both the item location (b-parameters) 18

20 and discrimination (a-parameters) were negatively impacted by item position change. In addition, D 2 values were larger when shifts in position were large between administrations. This indicates that the distribution and magnitude of a set of D 2 statistics could serve as a useful indicator of when factors like item position change or over-use of items may be adversely impacting scaling and equating. The results of the simulation study bore out the results from the real data analysis, while shedding additional light on the practical implications of the observed patterns. Item position change affected every measure evaluated. Not only were D 2 values dramatically larger in the large shift conditions, but as a result, the derived scale scores differed more from their preequated value at each raw score point. Ultimately, this resulted in situations where a large percent of cut scores used to assign students to performance levels shifted. It is important to note that the field test design (embedded versus field testing in a single block at the end of an assessment) also played a contributing role in the magnitude of changes described above. An embedded field test design, which leads to item position changes that are both positive and negative, was better able to mitigate the impact even when these item position changes were large. Furthermore, both the A and B Stocking and Lord constants were greater in magnitude when large shift item position change was simulated, indicating that a larger adjustment would be needed to bring the operational items onto the existing measurement scale. The scaling constants were also directly impacted by the choice of field-test design. While in both of the field-test design conditions, large shifts item position change led to an additive constant that was removed from the expected value of zero, the embedded field test produced constants that were 19

21 closer, on average, to the expected values of the scaling constants, but far more variable than the highly biased, but consistent values produced under the fixed field-test design. In practical terms, when items changed dramatically in their locations between administrations, the resulting achievement levels associated with each scale score changed from their pre-equated values. In addition to the impact of change in item position, the simulation study investigated the impact of item usage and elapsed time between administrations. In general, small impacts in the D 2, Stocking and Lord Constants, and scale scores for both elapsed time and item re-use were observed. From a research perspective, the findings of this study point to the continued importance of studying variables that impact the stability of items over time. The simulation results suggest that the impact of item position change contributes to instability of item parameter estimates regardless of conditions studied and that when possible, an embedded field test design should be employed. As seen in this investigation, allowing items to shift too far from prior use, and worse when the shifts occur in a single direction as in the fixed field test design condition, can result in errors in classification of students into performance levels. For high-stakes assessments this is a significant research finding as any misclassification of students, particular at the pass/fail cut score can have profound consequences. For test developers, this research suggests that small item position shifts are allowable; however, large item position shifts should be avoided or prohibited. Such a restriction must be then considered during test design and field testing. The outcomes from this research point to important future directions. First, it is important to continue to expand the research paradigm to refine the policies of different large-scale testing programs and how these policies drive decisions with regard to the variables studied here. Relaxed policies regarding IPC, usage, and time may increase the size of the initial regression 20

22 coefficients including variance accounted for in parameter estimate changes, which may result in larger Stocking and Lord adjustments, greater BIAS, and more error. In addition to surveying more large-scale programs to build the initial regression, it is plausible that there are other variables that impact changes in parameters and equating results. For additional variables to be considered it would be essential that they account for incremental variance not accounted for by the current variables. In summary, as suggested by prior research, this study confirmed that changing positions between administrations has a negative impact on the measurement properties of a test. When items shift substantially, some percentage of students can be classified into the wrong achievement level. In high-stakes testing in particular, like the assessment programs modeled here, this poses a large risk. Test developers are encouraged to try to keep items in similar or as close to the same positions across administrations. 21

23 References Kingston, N. M., & Dorans, N. J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied Psychological Measurement, 8, Klein, S. & Bolus. R. (1983). The effect of item sequence on bar examination scores. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Quebec. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer-Verlag. Kolen, M. & Harris, D. (1990). Comparison of item pre-equating and random groups equating using IRT and equipercentile methods. Journal of Educational Measurement, 27(1), Leary, L. & Dorans, N. (1982). The effects of item rearrangement on test performance: A review of the literature. (ETS Research Report RR-82-30). Princeton, NJ: Educational Testing Service. Leary, L. & Dorans, N. (1985). Implications for altering the context in which test items appear: A historical perspective on an immediate concern. Review of Educational Research, 55, Meyers, J.L, Miller, G.E., and Way, W.D. (2009). Item Position and Item Difficulty Change in an IRT-based Common Equating Design. Applied Measurement in Education, 22(1), Murphy, S., Little, I, Kirkpatrick, R., Fan, M., & Lin, C.H. (2010). The impact of different anchor stability methods on equating results and student performance. Paper presented at 22

24 the annual conference of the National Council on Measurement in Education, Denver, CO. Rubin, L. & Mott, D. (1984). The effect of the position of an item within a test on the item difficulty value. Paper presentation at the Annual Meeting of the American Educational Research Association, New Orleans, LA. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, Whitely, E. & Dawis, R. (1976). The influence of test context on item difficulty. Educational and Psychological Measurement, 36, Yen, W.M (1980). The extent, causes and importance of context effects on item parameters for two latent trait models. Journal of Educational Measurement, 17,

25 Table 1. Tests and number of items included in analysis. State Subject Grade N SE State Math 3 45 SE State Math 4 45 SE State Math 5 50 SE State Math 6 50 SE State Math 7 50 SE State Math 8 50 SE State Reading 3 50 SE State Reading 4 50 SE State Reading 5 60 SE State Reading 6 60 SE State Reading 7 70 SE State Reading 8 70 SE State ALG SE State BIO SE State ENG SW State Math 3 50 SW State Math 4 50 SW State Math 5 50 SW State Math 6 47 SW State Math 7 50 SW State Math 8 50 SW State Math 8 45 SW State Reading 3 50 SW State Reading 4 50 SW State Reading 5 50 SW State Reading 6 50 SW State Reading 7 50 SW State Reading 8 50 SW State Reading 8 45 SW State Science 5 45 SW State Social Studies 5 60 SW State Social Studies 7 45 SW State ALG SW State bio SW State ENG SW State geo SW State HIST

26 Table 2. Descriptive statistics for b-parameters as appearing on field tests and operational tests Field Test bs Live Test bs N Mean STD Min Max Mean STD Min Max State Subject Grade SE Math SE Math SE Math SE Math SE Math SE Math SE Reading SE Reading SE Reading SE Reading SE Reading SE Reading SE ALG SE BIO SE ENG SW Math SW Math SW Math SW Math SW Math SW Math SW Reading SW Reading SW Reading SW Reading SW Reading SW Reading SW Science SW Science SW Social Studies SW Social Studies SW Social Studies SW ALG SW BIO SW ENG SW GEO SW HIST

27 Table 3. Descriptive statistics for a-parameters as appearing on field tests and operational tests Field Test as Live Test Test as N Mean STD Min Max Mean STD Min Max State Subject Grade SE Math SE Math SE Math SE Math SE Math SE Math SE Reading SE Reading SE Reading SE Reading SE Reading SE Reading SE ALG SE BIO SE ENG SW Math SW Math SW Math SW Math SW Math SW Math SW Reading SW Reading SW Reading SW Reading SW Reading SW Reading SW Science SW Science SW Social Studies SW Social Studies SW Social Studies SW ALG SW BIO SW ENG SW GEO SW HIST

28 Table 4. Distribution of item position changes between test administrations IPC Frequency Percent Greater than Between -15 and Between -10 and Between -5 and Between 0 and Between 0 and Between 5 and Between 10 and Between 15 and Greater than

29 Table 5. Change in item discrimination as a function of item position change IPC N Mean change in a SD Min Max Greater than Between -15 and Between -10 and Between -5 and Between 0 and Between 0 and Between 5 and Between 10 and Between 15 and Greater than

30 Table 6. Change in Item difficulty as a function of change in item position IPC N Mean Change in b SD Min Max Greater than Between -15 and Between -10 and Between -5 and Between 0 and Between 0 and Between 5 and Between 10 and Between 15 and Greater than

31 Table 7. Change in D 2 as a function of change in item position IPC N Mean Change in D2 SD Minimum Maximum Greater than Between -15 and Between -10 and Between -5 and Between 0 and Between 0 and Between 5 and Between 10 and Between 15 and Greater than

32 Table 8. Elapsed time between operational administrations by test Time Elapsed Since Last Use N Test Subject Gr N % N % N % N % N % N % N % N % SE Math % % % % % % % % SE Math % % % % % % % % SE Math % % % % % % % % SE Math % % % % % % % % SE Math % % % % % % % % SE Math % % % % % % % % SE Reading % % % % % % % % SE Reading % % % % % % % % SE Reading % % % % % % % % SE Reading % % % % % % % % SE Reading % % % % % % % % SE Reading % % % % % % % % SE ALG % % % % % % % % SE BIO % % % % % % % % SE ENG % % % % % % % % SW Math % % % % % % % % SW Math % % % % % % % % SW Math % % % % % % % % SW Math % % % % % % % % SW Math % % % % % % % % SW Math % % % % % % % % SW Reading % % % % % % % % SW Reading % % % % % % % % SW Reading % % % % % % % % SW Reading % % % % % % % % SW Reading % % % % % % % % SW Reading % % % % % % % % SW Science % % % % % % % % SW Science % % % % % % % % SW Soc % % % % % % % % Stud. SW Soc % % % % % % % % Stud. SW Soc % % % % % % % % Stud. SW BIO % % % % % % % % SW ENG % % % % % % % % SW GEO % % % % % % % % SW HIST % % % % % % % % SW HIST % % % % % % % % Overall % % % % % % % % 31

33 Table 9. Number of previous operational uses by test Number of Previous Uses N Test Subject Gr N % N % N % N % N % N % N % SE Math % % % % % % % SE Math % % % % % % % SE Math % % % % % % % SE Math % % % % % % % SE Math % % % % % % % SE Math % % % % % % % SE Reading % % % % % % % SE Reading % % % % % % % SE Reading % % % % % % % SE Reading % % % % % % % SE Reading % % % % % % % SE Reading % % % % % % % SE ALG % % % % % % % SE BIO % % % % % % % SE ENG % % % % % % % SW Math % % % % % % % SW Math % % % % % % % SW Math % % % % % % % SW Math % % % % % % % SW Math % % % % % % % SW Math % % % % % % % SW Reading % % % % % % % SW Reading % % % % % % % SW Reading % % % % % % % SW Reading % % % % % % % SW Reading % % % % % % % SW Reading % % % % % % % SW Science % % % % % % % SW Science % % % % % % % SW Soc % % % % % % % Stud. SW Soc % % % % % % % Stud. SW Soc % % % % % % % Stud. SW BIO % % % % % % % SW ENG % % % % % % % SW GEO % % % % % % % SW HIST % % % % % % % SW HIST % % % % % % % OVERALL % % % % % % % 32

34 Table 10. Impact of item position change, elapsed time between administrations, and number of previous uses on item discrimination regressions on a parameter Test Subject Grade Intercept IPC TIME USES Rsquared Overall SIG SE Math ns SE Math ns SE Math sig SE Math ns SE Math ns SE Math ns SE Reading sig SE Reading ns SE Reading ns SE Reading sig SE Reading ns SE Reading sig SE ALG ns SE BIO ns SE ENG ns SW Math ns SW Math ns SW Math ns SW Math sig SW Math sig SW Math ns SW Reading sig SW Reading sig SW Reading ns SW Reading sig SW Reading sig SW Reading sig SW Science ns SW Science ns SW Social Studies ns SW Social Studies ns SW Social Studies ns SW ALG sig SW bio sig SW ENG ns SW geo sig SW HIST ns 33

35 34

36 Table 12. Impact of item position change, elapsed time between administrations, and number of previous uses on D 2 regressions on d-squared Test Subject Grade Intercept IPC TIME USES Rsquared Overall SIG SE Math ns SE Math sig SE Math sig SE Math sig SE Math sig SE Math sig SE Reading sig SE Reading sig SE Reading sig SE Reading ns SE Reading ns SE Reading ns SE ALG sig SE BIO sig SE ENG sig SW Math ns SW Math ns SW Math ns SW Math ns SW Math ns SW Math ns SW Reading sig SW Reading sig SW Reading sig SW Reading sig SW Reading ns SW Reading ns SW Science ns SW Science ns SW Social sig Studies SW Social ns Studies SW Social ns Studies SW ALG ns SW bio ns SW ENG ns SW geo ns SW HIST sig 35

37 Table 13. Impact of simulation study conditions on D 2 values Sample Time Use Position Fixed FT Embedded FT M SD MIN MAX M SD MIN MAX 500/ / / / / / / / / / / / / / / / / / / / / / / /

38 Table 14a. Stocking and Lord multiplicative constants by study condition Sample Size Time Usage IPC Fixed FT Embedded FT M SD MIN MAX M SD MIN MAX 500/ / / / / / / / / / / / / / / / / / / / / / / /

39 Table 14b. Stocking and Lord additive constants by study condition Sample Size Time Usage IPC Fixed FT Embedded FT M SD MIN MAX M SD MIN MAX 500/ / / / / / / / / / / / / / / / / / / / / / / /

40 Table 15a. Pre-Equated and post-equated scale score descriptive statistics by study condition Sample FT Design Time Usage IPC FT M SD MIN MAX M SD MIN MAX 500/2000 Fixed /2000 Fixed /2000 Fixed /2000 Fixed /2000 Fixed /2000 Fixed /2000 Fixed /2000 Fixed /5000 Fixed /5000 Fixed /5000 Fixed /5000 Fixed /5000 Fixed /5000 Fixed /5000 Fixed /5000 Fixed /10000 Fixed /10000 Fixed /10000 Fixed /10000 Fixed /10000 Fixed /10000 Fixed /10000 Fixed /10000 Fixed OP BIAS RMSD 39

41 Table 15b. Pre-Equated and post-equated scaled score descriptive statistics by study condition Sample FT Design Time Usage IPC FT M SD MIN MAX M SD MIN MAX 500/2000 Embedded /2000 Embedded /2000 Embedded /2000 Embedded /2000 Embedded /2000 Embedded /2000 Embedded /2000 Embedded /5000 Embedded /5000 Embedded /5000 Embedded /5000 Embedded /5000 Embedded /5000 Embedded /5000 Embedded /5000 Embedded /10000 Embedded /10000 Embedded /10000 Embedded /10000 Embedded /10000 Embedded /10000 Embedded /10000 Embedded /10000 Embedded OP BIAS RMSD 40

42 Table 16. Achievement level changes by study conditions Sample Time Usage IPC Fixed FT Design Cuts Change Embedded FT Design Cuts Change 500/ / / / / / / / / / / / / / / / / / / / / / / /

43 42

Equating Subscores Using Total Scaled Scores as an Anchor

Equating Subscores Using Total Scaled Scores as an Anchor Research Report ETS RR 11-07 Equating Subscores Using Total Scaled Scores as an Anchor Gautam Puhan Longjuan Liang March 2011 Equating Subscores Using Total Scaled Scores as an Anchor Gautam Puhan and

More information

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data Research Report Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data Gautam Puhan February 2 ETS RR--6 Listening. Learning. Leading. Chained Versus Post-Stratification

More information

Conditional Standard Errors of Measurement for Performance Ratings from Ordinary Least Squares Regression

Conditional Standard Errors of Measurement for Performance Ratings from Ordinary Least Squares Regression Conditional SEMs from OLS, 1 Conditional Standard Errors of Measurement for Performance Ratings from Ordinary Least Squares Regression Mark R. Raymond and Irina Grabovsky National Board of Medical Examiners

More information

Detecting Exposed Test Items in Computer-Based Testing 1,2. Ning Han and Ronald Hambleton University of Massachusetts at Amherst

Detecting Exposed Test Items in Computer-Based Testing 1,2. Ning Han and Ronald Hambleton University of Massachusetts at Amherst Detecting Exposed Test s in Computer-Based Testing 1,2 Ning Han and Ronald Hambleton University of Massachusetts at Amherst Background and Purposes Exposed test items are a major threat to the validity

More information

Ability Metric Transformations

Ability Metric Transformations Ability Metric Transformations Involved in Vertical Equating Under Item Response Theory Frank B. Baker University of Wisconsin Madison The metric transformations of the ability scales involved in three

More information

Effect of Repeaters on Score Equating in a Large Scale Licensure Test. Sooyeon Kim Michael E. Walker ETS, Princeton, NJ

Effect of Repeaters on Score Equating in a Large Scale Licensure Test. Sooyeon Kim Michael E. Walker ETS, Princeton, NJ Effect of Repeaters on Score Equating in a Large Scale Licensure Test Sooyeon Kim Michael E. Walker ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 37 Effects of the Number of Common Items on Equating Precision and Estimates of the Lower Bound to the Number of Common

More information

An Overview of Item Response Theory. Michael C. Edwards, PhD

An Overview of Item Response Theory. Michael C. Edwards, PhD An Overview of Item Response Theory Michael C. Edwards, PhD Overview General overview of psychometrics Reliability and validity Different models and approaches Item response theory (IRT) Conceptual framework

More information

Linking the Smarter Balanced Assessments to NWEA MAP Tests

Linking the Smarter Balanced Assessments to NWEA MAP Tests Linking the Smarter Balanced Assessments to NWEA MAP Tests Introduction Concordance tables have been used for decades to relate scores on different tests measuring similar but distinct constructs. These

More information

A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating

A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating Tianyou Wang and Michael J. Kolen American College Testing A quadratic curve test equating method for equating

More information

Basic IRT Concepts, Models, and Assumptions

Basic IRT Concepts, Models, and Assumptions Basic IRT Concepts, Models, and Assumptions Lecture #2 ICPSR Item Response Theory Workshop Lecture #2: 1of 64 Lecture #2 Overview Background of IRT and how it differs from CFA Creating a scale An introduction

More information

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin Equivalency Test for Model Fit 1 Running head: EQUIVALENCY TEST FOR MODEL FIT An Equivalency Test for Model Fit Craig S. Wells University of Massachusetts Amherst James. A. Wollack Ronald C. Serlin University

More information

Group Dependence of Some Reliability

Group Dependence of Some Reliability Group Dependence of Some Reliability Indices for astery Tests D. R. Divgi Syracuse University Reliability indices for mastery tests depend not only on true-score variance but also on mean and cutoff scores.

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 31 Assessing Equating Results Based on First-order and Second-order Equity Eunjung Lee, Won-Chan Lee, Robert L. Brennan

More information

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts A Study of Statistical Power and Type I Errors in Testing a Factor Analytic Model for Group Differences in Regression Intercepts by Margarita Olivera Aguilar A Thesis Presented in Partial Fulfillment of

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 24 in Relation to Measurement Error for Mixed Format Tests Jae-Chun Ban Won-Chan Lee February 2007 The authors are

More information

Shelby J. Haberman. Hongwen Guo. Jinghua Liu. Neil J. Dorans. ETS, Princeton, NJ

Shelby J. Haberman. Hongwen Guo. Jinghua Liu. Neil J. Dorans. ETS, Princeton, NJ Consistency of SAT I: Reasoning Test Score Conversions Shelby J. Haberman Hongwen Guo Jinghua Liu Neil J. Dorans ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

Choice of Anchor Test in Equating

Choice of Anchor Test in Equating Research Report Choice of Anchor Test in Equating Sandip Sinharay Paul Holland Research & Development November 2006 RR-06-35 Choice of Anchor Test in Equating Sandip Sinharay and Paul Holland ETS, Princeton,

More information

Haiwen (Henry) Chen and Paul Holland 1 ETS, Princeton, New Jersey

Haiwen (Henry) Chen and Paul Holland 1 ETS, Princeton, New Jersey Research Report Construction of Chained True Score Equipercentile Equatings Under the Kernel Equating (KE) Framework and Their Relationship to Levine True Score Equating Haiwen (Henry) Chen Paul Holland

More information

PIRLS 2016 Achievement Scaling Methodology 1

PIRLS 2016 Achievement Scaling Methodology 1 CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Computerized Adaptive Testing With Equated Number-Correct Scoring

Computerized Adaptive Testing With Equated Number-Correct Scoring Computerized Adaptive Testing With Equated Number-Correct Scoring Wim J. van der Linden University of Twente A constrained computerized adaptive testing (CAT) algorithm is presented that can be used to

More information

Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments

Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments Jonathan Templin The University of Georgia Neal Kingston and Wenhao Wang University

More information

FORECASTING STANDARDS CHECKLIST

FORECASTING STANDARDS CHECKLIST FORECASTING STANDARDS CHECKLIST An electronic version of this checklist is available on the Forecasting Principles Web site. PROBLEM 1. Setting Objectives 1.1. Describe decisions that might be affected

More information

Testing the Untestable Assumptions of the Chain and Poststratification Equating Methods for the NEAT Design

Testing the Untestable Assumptions of the Chain and Poststratification Equating Methods for the NEAT Design Research Report Testing the Untestable Assumptions of the Chain and Poststratification Equating Methods for the NEAT Design Paul W. Holland Alina A. von Davier Sandip Sinharay Ning Han Research & Development

More information

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006 LINKING IN DEVELOPMENTAL SCALES Michelle M. Langer A thesis submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Master

More information

The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence

The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence A C T Research Report Series 87-14 The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence Terry Ackerman September 1987 For additional copies write: ACT Research

More information

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Module - 04 Lecture - 05 Sales Forecasting - II A very warm welcome

More information

Simple Linear Regression: One Quantitative IV

Simple Linear Regression: One Quantitative IV Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,

More information

A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example

A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example Robert L. Brennan CASMA University of Iowa June 10, 2012 On May 3, 2012, the author made a PowerPoint presentation

More information

Alternative Growth Goals for Students Attending Alternative Education Campuses

Alternative Growth Goals for Students Attending Alternative Education Campuses Alternative Growth Goals for Students Attending Alternative Education Campuses AN ANALYSIS OF NWEA S MAP ASSESSMENT: TECHNICAL REPORT Jody L. Ernst, Ph.D. Director of Research & Evaluation Colorado League

More information

Item Response Theory and Computerized Adaptive Testing

Item Response Theory and Computerized Adaptive Testing Item Response Theory and Computerized Adaptive Testing Richard C. Gershon, PhD Department of Medical Social Sciences Feinberg School of Medicine Northwestern University gershon@northwestern.edu May 20,

More information

Introduction To Confirmatory Factor Analysis and Item Response Theory

Introduction To Confirmatory Factor Analysis and Item Response Theory Introduction To Confirmatory Factor Analysis and Item Response Theory Lecture 23 May 3, 2005 Applied Regression Analysis Lecture #23-5/3/2005 Slide 1 of 21 Today s Lecture Confirmatory Factor Analysis.

More information

A Use of the Information Function in Tailored Testing

A Use of the Information Function in Tailored Testing A Use of the Information Function in Tailored Testing Fumiko Samejima University of Tennessee for indi- Several important and useful implications in latent trait theory, with direct implications vidualized

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 23 Comparison of Three IRT Linking Procedures in the Random Groups Equating Design Won-Chan Lee Jae-Chun Ban February

More information

Observed-Score "Equatings"

Observed-Score Equatings Comparison of IRT True-Score and Equipercentile Observed-Score "Equatings" Frederic M. Lord and Marilyn S. Wingersky Educational Testing Service Two methods of equating tests are compared, one using true

More information

Curriculum Guide Cover Page

Curriculum Guide Cover Page Curriculum Guide Cover Page Course Title: Pre-Algebra Grade Level: 8 th Grade Subject/Topic Area: Math Written by: Jason Hansen Revised Date: November 2013 Time Frame: One School Year (170 days) School

More information

Assessing the relation between language comprehension and performance in general chemistry. Appendices

Assessing the relation between language comprehension and performance in general chemistry. Appendices Assessing the relation between language comprehension and performance in general chemistry Daniel T. Pyburn a, Samuel Pazicni* a, Victor A. Benassi b, and Elizabeth E. Tappin c a Department of Chemistry,

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Keppel, G. & Wickens, T.D. Design and Analysis Chapter 2: Sources of Variability and Sums of Squares

Keppel, G. & Wickens, T.D. Design and Analysis Chapter 2: Sources of Variability and Sums of Squares Keppel, G. & Wickens, T.D. Design and Analysis Chapter 2: Sources of Variability and Sums of Squares K&W introduce the notion of a simple experiment with two conditions. Note that the raw data (p. 16)

More information

Equating of Subscores and Weighted Averages Under the NEAT Design

Equating of Subscores and Weighted Averages Under the NEAT Design Research Report ETS RR 11-01 Equating of Subscores and Weighted Averages Under the NEAT Design Sandip Sinharay Shelby Haberman January 2011 Equating of Subscores and Weighted Averages Under the NEAT Design

More information

A Simulation Study to Compare CAT Strategies for Cognitive Diagnosis

A Simulation Study to Compare CAT Strategies for Cognitive Diagnosis A Simulation Study to Compare CAT Strategies for Cognitive Diagnosis Xueli Xu Department of Statistics,University of Illinois Hua-Hua Chang Department of Educational Psychology,University of Texas Jeff

More information

8. AN EVALUATION OF THE URBAN SYSTEMIC INITIATIVE AND OTHER ACADEMIC REFORMS IN TEXAS: STATISTICAL MODELS FOR ANALYZING LARGE-SCALE DATA SETS

8. AN EVALUATION OF THE URBAN SYSTEMIC INITIATIVE AND OTHER ACADEMIC REFORMS IN TEXAS: STATISTICAL MODELS FOR ANALYZING LARGE-SCALE DATA SETS 8. AN EVALUATION OF THE URBAN SYSTEMIC INITIATIVE AND OTHER ACADEMIC REFORMS IN TEXAS: STATISTICAL MODELS FOR ANALYZING LARGE-SCALE DATA SETS Robert H. Meyer Executive Summary A multidisciplinary team

More information

Annual Performance Report: State Assessment Data

Annual Performance Report: State Assessment Data Annual Performance Report: 2005-2006 State Assessment Data Summary Prepared by: Martha Thurlow, Jason Altman, Damien Cormier, and Ross Moen National Center on Educational Outcomes (NCEO) April, 2008 The

More information

AC : STATISTICAL PROCESS CONTROL LABORATORY EXERCISES FOR ALL ENGINEERING DISCIPLINES

AC : STATISTICAL PROCESS CONTROL LABORATORY EXERCISES FOR ALL ENGINEERING DISCIPLINES AC 2008-1675: STATISTICAL PROCESS CONTROL LABORATORY EXERCISES FOR ALL ENGINEERING DISCIPLINES Jeremy VanAntwerp, Calvin College Richard Braatz, University of Illinois at Urbana-Champaign American Society

More information

The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated Data

The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated Data Int. Statistical Inst.: Proc. 58th World Statistical Congress, 20, Dublin (Session CPS008) p.6049 The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated

More information

Washington State Test

Washington State Test Technical Report # 1101 easycbm Reading Criterion Related Validity Evidence: Washington State Test 2009-2010 Daniel Anderson Julie Alonzo Gerald Tindal University of Oregon Published by Behavioral Research

More information

The robustness of Rasch true score preequating to violations of model assumptions under equivalent and nonequivalent populations

The robustness of Rasch true score preequating to violations of model assumptions under equivalent and nonequivalent populations University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2008 The robustness of Rasch true score preequating to violations of model assumptions under equivalent and

More information

Equating Tests Under The Nominal Response Model Frank B. Baker

Equating Tests Under The Nominal Response Model Frank B. Baker Equating Tests Under The Nominal Response Model Frank B. Baker University of Wisconsin Under item response theory, test equating involves finding the coefficients of a linear transformation of the metric

More information

Weather Second Grade Virginia Standards of Learning 2.6 Assessment Creation Project. Amanda Eclipse

Weather Second Grade Virginia Standards of Learning 2.6 Assessment Creation Project. Amanda Eclipse Weather Second Grade Virginia Standards of Learning 2.6 Assessment Creation Project 1 Amanda Eclipse Overview and Description of Course The goal of Virginia s science standards is for the students to develop

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

DO NOT CITE WITHOUT AUTHOR S PERMISSION:

DO NOT CITE WITHOUT AUTHOR S PERMISSION: Study Context & Purpose Prior to the 2013-14 school year a Tennessee policy went into effect such that a teacher s level of effectiveness (LOE) and certification status determined their minimum number

More information

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

One-Way ANOVA. Some examples of when ANOVA would be appropriate include: One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement

More information

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts

More information

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis /3/26 Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis The Philosophy of science: the scientific Method - from a Popperian perspective Philosophy

More information

Development and Calibration of an Item Response Model. that Incorporates Response Time

Development and Calibration of an Item Response Model. that Incorporates Response Time Development and Calibration of an Item Response Model that Incorporates Response Time Tianyou Wang and Bradley A. Hanson ACT, Inc. Send correspondence to: Tianyou Wang ACT, Inc P.O. Box 168 Iowa City,

More information

4 Grouping Variables in Regression

4 Grouping Variables in Regression 4 Grouping Variables in Regression Qualitative variables as predictors So far, we ve considered two kinds of regression models: 1. A numerical response with a categorical or grouping predictor. Here, we

More information

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis The Philosophy of science: the scientific Method - from a Popperian perspective Philosophy

More information

Multidimensional Linking for Tests with Mixed Item Types

Multidimensional Linking for Tests with Mixed Item Types Journal of Educational Measurement Summer 2009, Vol. 46, No. 2, pp. 177 197 Multidimensional Linking for Tests with Mixed Item Types Lihua Yao 1 Defense Manpower Data Center Keith Boughton CTB/McGraw-Hill

More information

Because it might not make a big DIF: Assessing differential test functioning

Because it might not make a big DIF: Assessing differential test functioning Because it might not make a big DIF: Assessing differential test functioning David B. Flora R. Philip Chalmers Alyssa Counsell Department of Psychology, Quantitative Methods Area Differential item functioning

More information

Predicting Retention Rates from Placement Exam Scores

Predicting Retention Rates from Placement Exam Scores Predicting Retention Rates from Placement Exam Scores Dr. Michael S. Pilant, Dept. of Mathematics, Texas A&M University Dr. Robert Hall, Dept. of Ed. Psychology, Texas A&M University Amy Austin, Senior

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

What Rasch did: the mathematical underpinnings of the Rasch model. Alex McKee, PhD. 9th Annual UK Rasch User Group Meeting, 20/03/2015

What Rasch did: the mathematical underpinnings of the Rasch model. Alex McKee, PhD. 9th Annual UK Rasch User Group Meeting, 20/03/2015 What Rasch did: the mathematical underpinnings of the Rasch model. Alex McKee, PhD. 9th Annual UK Rasch User Group Meeting, 20/03/2015 Our course Initial conceptualisation Separation of parameters Specific

More information

A Note on the Choice of an Anchor Test in Equating

A Note on the Choice of an Anchor Test in Equating Research Report ETS RR 12-14 A Note on the Choice of an Anchor Test in Equating Sandip Sinharay Shelby Haberman Paul Holland Charles Lewis September 2012 ETS Research Report Series EIGNOR EXECUTIVE EDITOR

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

Assessing Studies Based on Multiple Regression

Assessing Studies Based on Multiple Regression Assessing Studies Based on Multiple Regression Outline 1. Internal and External Validity 2. Threats to Internal Validity a. Omitted variable bias b. Functional form misspecification c. Errors-in-variables

More information

1 st Grade LEUSD Learning Targets in Mathematics

1 st Grade LEUSD Learning Targets in Mathematics 1 st Grade LEUSD Learning Targets in Mathematics 8/21/2015 The learning targets below are intended to provide a guide for teachers in determining whether students are exhibiting characteristics of being

More information

A Unified Approach to Linear Equating for the Non-Equivalent Groups Design

A Unified Approach to Linear Equating for the Non-Equivalent Groups Design Research Report A Unified Approach to Linear Equating for the Non-Equivalent Groups Design Alina A. von Davier Nan Kong Research & Development November 003 RR-03-31 A Unified Approach to Linear Equating

More information

AS AN. Prepared by the. ASBOG Committee on Academic Assessment Randy L. Kath, Ph.D., PG, Chairman Richard K. Spruill, Ph.D., PG, Committee Member

AS AN. Prepared by the. ASBOG Committee on Academic Assessment Randy L. Kath, Ph.D., PG, Chairman Richard K. Spruill, Ph.D., PG, Committee Member NATIONAL ASSOCIATION OF STATE BOARDS OF GEOLOGY (ASBOG ) FUNDAMENTALS OF GEOLOGY (FG) EXAMINATION AS AN "ASSESSMENT EXAMINATION" Prepared by the ASBOG Committee on Academic Assessment Randy L. Kath, Ph.D.,

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

An Analysis of Field Test Results for Assessment Items Aligned to the Middle School Topic of Atoms, Molecules, and States of Matter

An Analysis of Field Test Results for Assessment Items Aligned to the Middle School Topic of Atoms, Molecules, and States of Matter An Analysis of Field Test Results for Assessment Items Aligned to the Middle School Topic of Atoms, Molecules, and States of Matter Cari F. Herrmann Abell and George E. DeBoer AAAS Project 2061 NARST Annual

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 41 A Comparative Study of Item Response Theory Item Calibration Methods for the Two Parameter Logistic Model Kyung

More information

TABLE OF CONTENTS INTRODUCTION TO MIXED-EFFECTS MODELS...3

TABLE OF CONTENTS INTRODUCTION TO MIXED-EFFECTS MODELS...3 Table of contents TABLE OF CONTENTS...1 1 INTRODUCTION TO MIXED-EFFECTS MODELS...3 Fixed-effects regression ignoring data clustering...5 Fixed-effects regression including data clustering...1 Fixed-effects

More information

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis /9/27 Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis The Philosophy of science: the scientific Method - from a Popperian perspective Philosophy

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

The Difficulty of Test Items That Measure More Than One Ability

The Difficulty of Test Items That Measure More Than One Ability The Difficulty of Test Items That Measure More Than One Ability Mark D. Reckase The American College Testing Program Many test items require more than one ability to obtain a correct response. This article

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

New York City Scope and Sequence for CMP3

New York City Scope and Sequence for CMP3 New York City Scope and Sequence for CMP3 The following pages contain a high-level scope and sequence for Connected Mathematics 3 and incorporate the State s pre- and poststandards guidance (see http://www.p12.nysed.gov/assessment/math/

More information

Analytics for an Online Retailer: Demand Forecasting and Price Optimization

Analytics for an Online Retailer: Demand Forecasting and Price Optimization Analytics for an Online Retailer: Demand Forecasting and Price Optimization Kris Johnson Ferreira Technology and Operations Management Unit, Harvard Business School, kferreira@hbs.edu Bin Hong Alex Lee

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Studies on the effect of violations of local independence on scale in Rasch models: The Dichotomous Rasch model

Studies on the effect of violations of local independence on scale in Rasch models: The Dichotomous Rasch model Studies on the effect of violations of local independence on scale in Rasch models Studies on the effect of violations of local independence on scale in Rasch models: The Dichotomous Rasch model Ida Marais

More information

AP Final Review II Exploring Data (20% 30%)

AP Final Review II Exploring Data (20% 30%) AP Final Review II Exploring Data (20% 30%) Quantitative vs Categorical Variables Quantitative variables are numerical values for which arithmetic operations such as means make sense. It is usually a measure

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor

More information

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology Psychology 308c Dale Berger Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology This example illustrates modeling an interaction with centering and transformations.

More information

Propensity Score Matching

Propensity Score Matching Methods James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 Methods 1 Introduction 2 3 4 Introduction Why Match? 5 Definition Methods and In

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Ensemble Rasch Models

Ensemble Rasch Models Ensemble Rasch Models Steven M. Lattanzio II Metamatrics Inc., Durham, NC 27713 email: slattanzio@lexile.com Donald S. Burdick Metamatrics Inc., Durham, NC 27713 email: dburdick@lexile.com A. Jackson Stenner

More information

Introduction to Confirmatory Factor Analysis

Introduction to Confirmatory Factor Analysis Introduction to Confirmatory Factor Analysis Multivariate Methods in Education ERSH 8350 Lecture #12 November 16, 2011 ERSH 8350: Lecture 12 Today s Class An Introduction to: Confirmatory Factor Analysis

More information

Research on Standard Errors of Equating Differences

Research on Standard Errors of Equating Differences Research Report Research on Standard Errors of Equating Differences Tim Moses Wenmin Zhang November 2010 ETS RR-10-25 Listening. Learning. Leading. Research on Standard Errors of Equating Differences Tim

More information

The Effect of Differential Item Functioning on Population Invariance of Item Response Theory True Score Equating

The Effect of Differential Item Functioning on Population Invariance of Item Response Theory True Score Equating University of Miami Scholarly Repository Open Access Dissertations Electronic Theses and Dissertations 2012-04-12 The Effect of Differential Item Functioning on Population Invariance of Item Response Theory

More information

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals Past weeks: Measures of central tendency (mean, mode, median) Measures of dispersion (standard deviation, variance, range, etc). Working with the normal curve Last week: Sample, population and sampling

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Introduction to Factor Analysis

Introduction to Factor Analysis to Factor Analysis Lecture 10 August 2, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #10-8/3/2011 Slide 1 of 55 Today s Lecture Factor Analysis Today s Lecture Exploratory

More information