Ensemble Rasch Models

Size: px

Start display at page:

Download "Ensemble Rasch Models"

Neil Austin
5 years ago
Views:

1 Ensemble Rasch Models Steven M. Lattanzio II Metamatrics Inc., Durham, NC Donald S. Burdick Metamatrics Inc., Durham, NC A. Jackson Stenner Metamatrics Inc., Durham, NC July 11,

2 Author s Footnote: Steven M. Lattanzio II is Research Engineer, MetaMetrics Inc., Durham, NC ( slattanzio@lexile.com); Donald S. Burdick is Senior Scientist, MetaMetrics Inc., Durham, NC ( dburdick@lexile.com); and A. Jackson Stenner is Chief Executive Officer, MetaMetrics Inc., Durham, NC ( jstenner@lexile.com); 2

3 Abstract Rasch Models (RM) model the probability of success on a test item for a person with a given ability, or more broadly, the probability of a given outcome for an item encountered by an entity with a latent propensity to achieve that outcome. Typically RMs involve the use of a field test to acquire data for the purpose of calibrating individual item parameters. If those items are known to come from an ensemble of items with an assumed distribution of difficulties, then it is still possible to calibrate the mean item parameters and latent traits based on a modified RM that includes a random term, i.e. a model that accounts for uncertainty in item difficulty. This paper explores the use of a unidimensional RM modified with such a term. The ability to calibrate instruments/items based on an ensemble rather than individual item difficulties has many benefits including reduced field-testing costs and applications to autonomously generated items on tests when the characteristics of the ensemble can reasonably be approximated through theory. The value of such a model is demonstrated via a sensitivity analysis based on simulated data. Keywords: measurement, item response theory, random effect, ensemble calibration 3

4 1. INTRODUCTION Traditionally, when psychometricians want to determine such things as item difficulty and student ability they turn to Item Response Theory (IRT) and the use of a psychometric measurement model such as the Rasch Model. Rasch Models have been used in psychometrics and other fields for decades. Generally speaking, the Rasch Model describes the probability of a given outcome for an item encountered by an entity with a given propensity to achieve that outcome. Within the context of IRT, multiple tests composed of dichotomous items can be used to measure a common construct for persons on a psychometric scale. Data from field tests are typically used to estimate the item parameters and person parameters (or latent traits) from a psychometric measurement model like the Rasch Model. A calibration for the test instrument can be derived via the estimated item parameters. Within education, Rasch Models are commonly used to calibrate reading and mathematics achievment tests. In order to avoid confusion and to be consistent with literature related to Rasch Models, the more general terms latent trait and item parameter will be referred to in more specific and common terms as person ability and item difficulty respectively despite the fact that they are not strictly accurate in some potential applications. There has been a desire to calibrate person ability and item difficulty when single-instance items are encountered instead of reused items. Single-instance, or one-off items are desirable since computers can generate them on the fly (Hanlon, Swartz, Stenner, Burdick and Burdick 2010). Since in practice a computer develops a unique test for each individual, test fraud can be minimized and the cost of test development can be dramatically reduced. Traditional approaches to item development and calibration are expensive and once items have been used in a live testing cycle they often must be retired and new items written. However, since only one person may ever take any single item in the continuous measurement model, traditional approaches to item calibration are not useful. This context motivates the need for a model that allows for an item to be described as belonging to an ensemble of items where the distribution of item difficulties is assumed 4

5 to have certain characteristics, typically normally distributed with a theory supplied mean and standard deviation. Item calibrations can be sampled from the ensemble and used to compute person abilities for counts correct (measurement outcomes). Each item does not need to be associated with a particular response pattern thanks to the raw score sufficiency feature of the Rasch Model. There is no information in the pattern of right/wrong responses about the person ability if the data fits the Rasch Model. Thus, it is immaterial to the calculation of person ability how the item calibrations are assigned to the item responses. The idea of modeling uncertainty in ability is an old one, but the notion of modeling uncertainty in item difficulty is still a relatively recent idea. De Boeck (2008) introduces the concept of random item IRT models and presents a useful taxonomy of IRT models. 2. THE ENSEMBLE RASCH MODEL 2.1 The Rasch Model Speaking in terms of person ability and item difficulty, the Rasch Model maps the difference between a person s ability and item difficulty into a probability of a successful response where the probability is modeled as a logistic function of that difference. In this paper, person ability, item difficulty, and the difference between them are denoted as θ, β, and η respectively. In other words, η = θ β. (1) The mathematical form of the Rasch Model for dichotomous data can be described as follows: Let X = x {0, 1} be a dichotomous random variable where X = 1 represents a successful response and X = 0 represents an unsuccessful response. The probability of success is Pr{X = 1} = exp(θ β) 1 + exp(θ β) (Rasch 1960). This can be written more compactly as a function of η that is known as the Item Characteristic Curve (ICC) P (η) = (2) exp(η) 1 + exp(η). (3) 5

6 The model is defined such that P (0) = 0.5, i.e. that the probability of success when person ability matches item difficulty is 50%. However, it should be noted that while the Rasch Model is shown where θ, β, and subsequently η have units of logits, values can be arbitrarily scaled and shifted to match other units that may be desirable as long as the shape of the curve is maintained. The scaling and shifting can be described by P (η) = exp(aη + b) 1 + exp(aη + b) (4) where a and b are arbitrarily set. While a allows for the logit unit to be scaled in magnitude, b allows for P (0) to be set to a desired probability of success for equal person ability and item difficulty. For example, if b = 1.1 then P (0) = It is also important to note that the Rasch Model has locational indeterminacy, i.e. the only thing that is important is the difference between person ability θ and item difficulty β (otherwise known as η). 2.2 Extension to the Ensemble Rasch Model When items lacking individual calibrations are taken from an ensemble, it is necessary to include the uncertainty surrounding individual item difficulties in the model. Individual item difficulties are thought to come from a distribution of item difficulties, specifically a normal distribution with a mean and standard deviation σ. This leads to what can be referred to as the Ensemble Rasch Model. The Ensemble Rasch Model can be written as P (η, σ) = E { exp(η + ε) } 1 + exp(η + ε) (5) where E{ } is the expected value operator and ε N(0, σ 2 ). This can also be expressed in the form of an integral P (η, σ) = 1 2πσ 2 ( ε 2 2σ 2 ) exp dε (6) 1 + exp( (η + ε)) of which there is no known closed-form solution. This results in different ICCs for different values of σ as shown in Figure 1. The ICC for σ = 0 is the graph of (3). 6

7 Figure 1: Item Characteristic Curves for the Ensemble Rasch Model. It should be mentioned that while the stochastic term ε is used here to describe the uncertainty associated with item difficulty, it can also become a catch-all term for including uncertainty in person ability and mean item difficulty. When using the Ensemble Rasch Model to estimate person ability, calibration is in terms of ensemble parameters rather than individual item parameters. The ensemble parameters implicit in (5) are the ensemble mean β and the standard deviation σ. When the ensemble calibration paramters are specified, the probability P (η, σ) as defined in (5) and (6) becomes a function P (θ) of the ability parameter. Assuming local independence for the sample of items from the ensemble, the likelihood function when R out of L items are answered correctly becomes L(θ; R) = [P (θ)] R [1 P (θ)] L R. (7) It follows that R is a sufficient statistic and the maximum likelihood estimate of θ is the value 7

8 3. SENSITIVITY ANALYSIS A sensitivity analysis will be conducted in order to determine the utility of having not only a good estimate of the standard deviation of individual item difficulties, but the utility of including uncertainty in individual item difficulties in the first place. This will be done by looking at the results of empirical data from calibrations based on ensemble means (referred to as ensemble calibration) where all of the data is generated through simulation. The effect of a deviation from the true standard deviation is shown in plots of the simulated data and is also given as a value of expected mean square error between the value of expected raw score based on the true individual item difficulty standard deviation and one based on an assumed value. 3.1 Analytical error To get an idea how sensitive results based on real data will be to differences between the true and assumed values of standard deviation, σ T and σ A respectively, for individual item difficulty parameters, it is possible to come up with an expression for the expected mean square error (MSE) between the value of expected raw score based on σ T and σ A. This expression is E{MSE} = (P (η, σ T ) P (η, σ A )) 2 p η (η)dη (8) where p η (η) is the probability density function for η. Note that this equation represents the expected MSE when tests are considered to have an infinite number of items, unlike the simulated results. This was done to make the problem tractable. This expression produces the plots in Figure 2 when p η (η) = U( r, r), i.e. each person has an equal likelihood of taking a test with a true mean item difficulty within r logits of their true ability and does not encounter tests outside of that range. Note that as the tests are better targeted, the MSE is less sensitive to differences between σ T and σ A. 8

9 Figure 2: Expected mean-square error for raw scores for various values of r. a) σ T = 0.5. b) σ T = 1.0. c) σ T = Data simulation A true value for a measure of ability θ T,i, i {1, 2,..., M} is determined for M individuals. These are randomly generated from a normal distribution with a mean µ θ and standard deviation σ θ. A true value for a measure of the mean item difficulty β T,j, j {1, 2,..., N} is determined for N tests. These are randomly generated from a normal distribution with a mean µ β and standard deviation σ β. Tests consist of different numbers of items, the j th test having K j items, where K j is randomly generated from a normal distribution with a mean µ K and standard deviation σ K. 9

10 Values are rounded to the nearest integer and all values that fall below a minimum number of items K min are set to equal K min. Individual item difficulty β i,j,k (the difficulty of the item for the i th person on the j th test on the k th item where k {1, 2,..., K n } is randomly generated from a normal distribution with mean µ T,j and a true value for standard deviation σ T that is considered the same for all ensembles. A history of raw scores is recorded in an array Y where Y i,j is the fraction of items correct for individual i on test j. This was determined through simulation where each item on each test for each individual was determined to either be successful or unsuccessful based on whether a value p, which is randomly generated from a uniform distribution ranging from 0 to 1 and is generated for each item for each individual for each test taken, is less than or equal to the probability of success for an individual with a given ability θ T,i on an item with a given difficulty θ i,j,k as described by the Rasch Model. Not all tests were encountered by every individual. Each individual had a unique probability Q i of taking any given test whose difficulty falls within a range that extends plus or minus r logits from their true ability θ T,i. Values of Q i are randomly generated from a uniform distribution from 0 to 1. Y i,j will be empty if individual i did not encounter test j. An individual was considered to take a test if a value q, which is randomly generated from a uniform distribution ranging from 0 to 1, is less than or equal to Q i. Data from individuals who did not encounter at least t min tests and tests that were not taken at least e min times were ignored. 3.3 Ensemble calibration Ensemble calibration is a technique used when test items are considered to be a randomly generated subset of an ensemble of items with a particular distribution, typically a normal distribution with a mean difficulty and standard deviation. The technique involves iteratively updating estimates of both person ability and ensemble item difficulty until they sufficiently converge. 10

11 For an assumed value of the standard deviation of item difficulty σ B it is possible to construct a function describing the expected success rate of an individual on an item using the Ensemble Rasch Model described in this paper. Given theoretical mean item difficulties β T,j, j {1, 2,..., N} for each ensemble (in this case they are presumed to be the same as the true mean difficulties from the simulation), the first step is to find empirical individual abilities θ E,i, i {1, 2,..., M} based on the raw scores in Y. This can be accomplished by using Ensemble Rasch Models for each individual on each test that they encountered and creating a new sigmoidal function for each individual that is the result of a weighted sum of the individual sigmoids shifted by β T,j. The weighting is based on the amount of items in each test that was taken by the individual and the weights sum to one. This new sigmoid describes the expected total fraction of items correct out of all of the items of the tests taken by an individual as a function of ability (instead of a function of the difference between ability and difficulty). Subsequently, an empirical ability can be found based on the individual s total raw score, i.e. the values that result from a weighted summation along the rows of Y where the weighting is the same weighting as described earlier in this paragraph. Mathematically, this function can be written as P E (θ) = N j=1 n jp j (θ) N j=1 n j (9) where P j (θ) is the Ensemble Rasch Model ICC as a function of ability for known values of mean item difficulty for β T,j, j {1, 2,..., N} and a known value of σ, and n j is the number of items in test j. With initial empirical estimates for individual ability, theoretical values for test mean item difficulty, and raw scores, it is possible to determine empirical test mean item difficulties. This is simply done by using the raw scores in Y for each individual/test combination and finding the corresponding values for the difference between individual ability and test mean item difficulty from the standard P (η, σ) sigmoid. The resulting values for η i,j can be used to find β E,j by finding the average value of β E,j η i,j for each j excluding the values corresponding to tests not taken by an individual. To prevent drifting, the mean of the set 11

12 of empirical test mean item difficulties is anchored to the mean of the set of theoretical test mean item difficulties. After the empirical values for test mean item difficulties are found, it is possible to do a second iteration of finding empirical individual abilities by replacing the theoretical test mean item difficulties with the empirical ones. After this, a new empirical test mean item difficulty can be found the same way as before. Iterations can continue until a stopping criterion is met and the solutions are thought to converge. This can take many forms. A threshold value α for the relative change between iterations in the mean squared error between theoretical and empirical test mean item difficulty is a useful basis for a stopping criterion. Figure 3 shows this process in block diagram form. Block 1 is the function that combines raw scores and current mean item difficulty estimates into ability estimates. Block 2 is the function that combines the ability estimates and raw scores into mean item difficulty measurements that are fed back into block 1. This loop continues until a stopping criteria is met. β T Y θ 1 2 β E β Figure 3: Ensemble calibration block diagram. 3.4 Results Data simulation and ensemble calibration based on that data were executed based upon the parameter values given in Table 1. As shown in the table, the assumed values for the 12

13 standard deviation of test item difficulty σ β range from 0 to 2. Table 1: Parameter values. Number of individuals, M 100 Mean person ability, µ θ 0 STD of person ability, σ θ 1 Number of tests, N 100 Mean test difficulty, µ β 0 STD of test difficulty, σ β 1 Mean number of items on a test, µ k 40 STD of number of items on a test, σ k 20 Minimum number of items on a test, k min 10 True STD of test item difficulty, σ T 1 Range for targeted tests, r 1 Minimum tests taken by an individual, t min 10 Minimum times a test was taken, e min 10 Relative convergence criterion, α Assumed values for STD of test item difficulty, σ A (0, 2] Figure 4 shows the results for the ensemble calibration based on the simulated data for three different values of assumed test item difficulty standard deviation σ β which are ɛ (a very small number close to zero), 1 (the true value for standard deviation σ T ), and 2. It is observed that when σ T is underestimated, low-end values for person ability and test difficulty are overestimated and high-end values are underestimated. The opposite is true when σ T is overestimated. Figure 5 shows the MSE between the theoretical and empirical values for test difficulty as a function of the assumed standard deviation σ A for σ T = 1. In this plot it is observed 13

14 Figure 4: Ensemble calibration results. a, c, and e) prescribed ability θ T vs. empirical ability θ E for σ T = 1 and σ β = ɛ, 1, and 2 respectively. b, d, and f) prescribed mean item difficulty β T vs. empirical mean item difficulty β E for σ β = 1 and σ β = ɛ, 1, and 2 respectively. that the MSE is at a minimum near σ A = σ T, suggesting that the Ensemble Rasch Model is an appropriate model when items are random. Note that the MSE for a traditional Rasch Model is shown where σ A = 0 and that the Ensemble Rasch Model performs better for good estimates of σ T. However, the MSE increases significantly as the true value for standard deviation of item difficulty σ T is overestimated, which suggests the importance of having a good guess for σ T. 14

15 Figure 5: Empirical MSE for β T and β E based on simulation. 4. LEARNING OASIS A specific application for ensemble calibration involves the reading research platform Learning Oasis developed by MetaMetrics, Inc (Hanlon et al. 2010). The Learning Oasis application provides a combination of assessment and instruction where students read and encounter a variety of item types. Some of these item types, such as the auto-generated semantic cloze item, present unique challenges for traditional psychometrics. The Learning Oasis application generates a new set of these items for each article for every student encounter with that article. No two students ever see the same item. Thus, there is not enough data to produce individual item statistics because each item is used only once (unless by chance an identical item is generated at another time for another student). However, student measures are updated using a Bayesian algorithm after each article is read. The initial calibration of the mean ensemble difficulty of the articles used by Learning Oasis is provided by the Lexile R Framework for Reading (Stenner 1996). The Lexile R Framework for Reading is a 15

16 tool that provides measures of text difficulty and student reading (or writing) ability on the same scale, which can be converted to logits in the manner described earlier in this paper with values a = and b = 1.1. As students read more articles, more data is collected and empirical estimates of text complexity are computed. Within Learning Oasis, so-called cloze items are generated by selecting words that are within a specified range of the difficulty of the article and removing them. Students are tasked with choosing the correct word out of a list of four words; one is the correct word and the other three are foils that occur with similar frequency in language around the difficulty of the text. This type of item is very similar to the cloze item technique in Taylor (1953) except for the fact that items are automatically generated, hence the term auto-generatedcloze. Because these items are randomly chosen out of the text of the article (within certain constraints) and the foil words are similarly chosen, these items can be thought of as coming from a large ensemble of potential items where item difficulty can be described by a probability distribution function. In the case of Learning Oasis, it is assumed that items come from a normal distribution with a mean difficulty equal to the difficulty of the article s text and an assumed standard deviation. This assumed standard deviation can be confirmed (or determined) empirically through ensemble calibration and is considered to be the same for every article. In summary, the tests are calibrated by estimating the parameters of the ensemble instead of the individual item parameters. 5. DISCUSSION 5.1 Potential improvements There are several improvements that can be made to the Ensemble Rasch Model and its use in ensemble calibration that are beyond the scope of this paper. First, it is thought that constraining the standard deviation of item difficulties for all tests is an over-simplification, and for that matter, so is constraining the distributions of item difficulty to be normal in the first place. While the assumption of a uniform value for standard deviation is not without merit, there are a couple of potential remedies. First, it is possible to determine values for 16

17 the standard deviation of each individual test based on theoretical values for item difficulty. If a relationship between certain characteristics of the test and the spread of item difficulties is found, then that can be exploited to provide better estimates of the distribution of item difficulties. Second, it is possible to calibrate the standard deviations of item difficulty for each test much like the mean difficulty is calibrated. However, this would be a much more computationally expensive undertaking. One simple way to do this would be to look at the person ability and mean item difficulties at each iteration of the ensemble calibration and add a step where standard deviations are tweaked for each individual test so that the raw scores of the persons who encountered that test match the abilities on a P (η, σ) curve for that particular test as much as possible. Additionally, it may be possible to consider arbitrarily shaped distributions for each test. Yet another extension of the Ensemble Rasch Model involves extending the model to include multiple dimensions such as is done in Briggs and Wilson (2003) for a non-random Rasch Model and/or generalizing it to include polytomous data such as is done in Andrich (1978). 5.2 Other applications In addition to its use in Learning Oasis, there are other applications for a Ensemble Rasch Model. Within the realm of education, very similar applications can be developed for math items. MetaMetrics, Inc. also has a mathematic ability measure known as the Quantile R that is based on the idea that there are many types of math skills known as QTaxons that fall along a developmental continuum (The Quantile Framework for Mathematics 2012). A similar application to Learning Oasis may involve auto-generated math problems where a math problem is considered to come from an ensemble of problems where the mean difficulty is the same as the difficulty of the particular QTaxon and a standard deviation that describes the spread of difficulty of those types of problems. Outside of education, Ensemble Rasch Models can be used where individuals encounter tasks that are inherently single-instance. For example, within the sport of baseball, batters 17

18 will face multiple pitches from many different pitchers. Pitchers will throw a set of pitches with varying degrees of difficulty. The pitcher will never throw the exact same pitch twice and his pitches can be regarded as coming from a normal distribution of pitches with a mean difficulty and a standard deviation. Success can be defined in many different ways. It could be hitting a homerun, getting on base, not striking out, etc. Regardless, applying the Ensemble Rasch Model would provide insight into a player s propensity for a given outcome. Such a technique would be a useful tool for evaluating prospects or simply giving fans more interesting statistics. Imagine you are watching a baseball game on television and instead of seeing a batting average on the screen, you are shown the likelihood that the batter will get a hit (or homerun, etc.) on that at-bat versus that particular pitcher. Additionally, interesting insight could be obtained about the ability of players throughout the history of the sport. For example, it would be possible to determine how many homeruns a legend such as Babe Ruth would be expected to hit if he was playing for the Yankees in the year 2012 instead of 1927 when he hit 60 homeruns in a season. Many other sport related applications can be conceived, including applications for sports such as football, basketball, and tennis, given adequate statistics. REFERENCES Andrich, D. (1978), A rating formulation for ordered response categories, Psychometrika, 43, Briggs, D., and Wilson, M. (2003), An introduction to multidimensional measurement using Rasch models, Journal of Applied Measurement, 4(1), De Boeck, P. (2008), Random Item IRT Models, Psychometrika, 73(4), Hanlon, S., Swartz, C., Stenner, A., Burdick, H., and Burdick, D. (2010), Oasis Literacy Research Platform,, 18

19 Rasch, G. (1960), Probabilistic models for some intelligence and attainment tests, Copenhagen: Danish Institute for Educational Research. Stenner, A. (1996), Measuring Reading Comprehension with the Lexile Framework,, in Fourth North American Conference on Adolescent/Adult Literacy, Washington, D.C., February 1, Taylor, W. (1953), Cloze Procedure: A New Tool for Measuring Readability, Journalism Quarterly, 30, The Quantile Framework for Mathematics (2012), 19

Lesson 7: Item response theory models (part 2)

Lesson 7: Item response theory models (part 2) Patrícia Martinková Department of Statistical Modelling Institute of Computer Science, Czech Academy of Sciences Institute for Research and Development of