Paul Barrett

Size: px

Start display at page:

Download "Paul Barrett"

Patrick Small
5 years ago
Views:

Paul Barrett email: p.barrett@liv.ac.uk http://www.liv.ac.uk/~pbarrett/paulhome.

1 Paul Barrett Affiliations: The The State Hospital, Carstairs Dept. of of Clinical Psychology, Univ. Of Of Liverpool 20th 20th November, 1998/*Addendum on on 12/10/99

2 What is is Rasch Scaling.1.1 A mathematical procedure that attempts to scale responses to individual items, such that the probability of answering an item in a certain way (whether YES/NO, or multiple-choice Likert format) is computed solely from the amount of a latent variable that a person is measured to possess and from the difficulty measure for the item.

3 What is is Rasch Scaling.2.2 Scaling can be defined as: the encoding of empirical observations using numbers to represent attribute/variable magnitudes, given a set of rules or axioms that the proposed measurement must subsequently satisfy.

4 What is is Rasch Scaling.3.3 latent variable can be defined as: the particular, inferred, construct that we are trying to measure with a set of items. This may be an ability, an attitude, or a personality variable such as Anxiety. A factor from a factor analysis is what we would also generally refer to as a latent variable or attribute.

5 What is is Rasch Scaling.4.4 difficulty measure for an item can be defined as: Classically, the ratio of the number of respondents scoring an item in the keyed or correct direction, over the total number of respondents. In Rasch scaling Rasch scaling, it is an index that expresses the position of the item on the latent variable scale, where 50% of the respondents on the test would respond in the keyed or correct direction.

6 What is is Rasch Scaling.5.5 Critical Point Rasch scaling uses the same scale of measurement for expressing both item difficulty and person ability. That is, the same unit of measurement is used to express difficulty and ability.

7 What is is Rasch Scaling An Item Characteristic Curve (ICC) Probability of a correct reponse to this item a = discrimination parameter. The value of the slope of the line at the midpoint of the curve (inflexion point) b = item difficulty parameter. The location of the inflexion point of the curve on the Theta axis Latent Variable Measure -Theta- (in z-scores)

8 What is is Rasch Scaling The same ICC - which now includes the guessing parameter - c=0.2 Probability of a correct response to this item c = guessing probability. the lower asymptote of the ICC curve. a = discrimination parameter. The value of the slope of the line at the midpoint of the curve (inflexion point) b = item difficulty parameter. The location of the inflexion point of the curve on the Theta axis Latent Variable Measure (in z-scores)

9 What is is Rasch Scaling.8.8 The one-parameter Rasch Model The two-parameter IRT Model p p i i ( θ ) ( θ ) 1 = D( θ 1+ e 1 b i ) = Da i ( θ b i ) 1+ e The three-parameter IRT Model p i ( θ ) c (1 c ) 1+ e = i + i Da i ( θ b i ) whereθ = the measure (score) of a person on the latent trait b = the difficulty of item i a = thediscrimination of item i c = the guessing probability for item i D = a constant used for "normalisation" Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November

10 What is is Rasch Scaling.9.9 So, what we are doing is attempting to model the responses to items in a test, given the amount of the latent variable inferred to be present within every individual who provided responses, and the difficulty of each item. But, given we do not know the individuals latent variable scores or the item difficulties, these have to be estimated - jointly. The solution is iterative, requiring a computer to implement the estimation process.

11 What is is Rasch Scaling.10 How do we create a test score (measure)? By summing the probabilities of keyed or correct responses for each item in the test, using our model parameters of item difficulty and person ability. X j where K = And: i= 1 P i X P ( θ ) i j = = P ( θ ) i the test score for person the probability for item 1 = D( θ 1+ e items Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 b i ) i j of with ability θ K

12 Why should we prefer Rasch over CTT??.1.1 Classical Test Theory: CTT The equation x = t + e provides the essence of the foundational proposition of this theory. x = the observed test score t = a hypothetical error-free true-score e = the random error associated with a true score. Further, items are assumed to be sampled from universes or domains. Estimation of reliability and other parameters may be made using the algebra of linear sums.

13 Why should we prefer Rasch over CTT??.2.2 A Probabilistic form of Additive Conjoint Measurement Conjoint Measurement.1 Conjoint Measurement.1 The function that describes the concatenation relation between two variables and a third can be deduced axiomatically from the measurements made of the outcome (the third variable) produced by combining the values of the two variables. In our case, the items and the amount of latent variable are combined to produce a third variable (the test score).

14 Why should we prefer Rasch over CTT??.3.3 Conjoint Measurement.2 It requires that the two variables in the concatenation operation are non-interactive (i.e. values on each variable can be manipulated independently of each other). It enables quantitative structure to be detected via ordinal relations upon a variable. As Cliff (1992) has written a certain kind of mild-looking ordinal consistency among three or more variables is necessary and sufficient to define equal-interval scales. Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

15 Why should we prefer Rasch over CTT??.4.4 Why should this matter? Because: measures created using the Rasch measurement model also satisfy the constraints of conjoint measurement. This means that creating tests using the Rasch measurement model gives you equal interval measurement AND additivity of units. Further, Rasch measurement also gives you unidimensional measurement, given the measurement axioms are met (i.e. the model fits the data).

16 Why should we prefer Rasch over CTT??.5.5 Resume of Features associated with the model: Equal-interval, additive units of measurement An explicit ordering of items as a cumulative response scale Sample Free calibration of item and person parameters Computation of both item and person reliability Computation of the location-sensitive standard error-of-measurement over the range of the test measures.

17 So far so good. But this is all theory. What happens in practice? Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

18 Data: EPQ -N (Neuroticism Scale) - UK reference sample Number of Items = 23 Number of respondents = 4140 Mixed gender adults Scale alpha = First, I take a look at a single item characteristic curve prior to fitting the Rasch model. I am predicting the probability of respondents keying this item in the scored direction, for each possible scale score for the test (0-23). For convenience, I convert the scale scores into standardized z-score values prior to fitting a scaled logistic Rasch function to the item.

19 Probability of correct response Fitting the EPQ - N item #3 (Does your mood often go up or down?) Model is: Probability=1/(1+euler^(-1.7*( )*(x-( )))) Least Squares fit = (% variance accounted for) Z score Transformed Raw score level

20 I then fit the Rasch model to the scale of items (using the Andrich et al RUMM software package). The data fail to fit the Rasch model (using the Chi-Square test of model fit) at P < (actually p ~2.5* ) Apart from 2 items, none fit the model (using standardised chi-square residual tests) However, I note that the Rasch person measures correlate at 0.99 with the conventional raw score.

21 Raw Scale Scores Scatterplot: Rasch measures vs Raw EPQ - N scale scores (r=0.99) UK Reference sample Data. N=4140 Mixed Gender sample RASCH Measures

22 Next, I find that my item-difficulties are now quite different to those computed individually, as shown on the previous slide. The plot of Rasch predicted vs actual proportions of respondents scoring an item in the keyed direction, at each Raw Scale score (or Rasch measure) exemplifies this discrepancy for item #3.

23 Rasch Expected Cumulative Probabilities vs Observed Probabilities Probability of a Keyed response Observed Data Predicted Data Rasch Measures (correlate 0.99 with Raw Scores) Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

24 It would appear that by concentrating solely upon adjusting item difficulty location parameters, and latent trait person parameters, whilst minimising the discrepancy between the predicted raw test scores and actual raw test scores, the Rasch modelling procedure has indirectly induced considerable item misfit whilst attempting to remain within the constraints required by the axiomatic measurement properties. This is somewhat unexpected and more than a cause for serious concern - especially as the model (and all items) fit when N=200 respondents! Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

25 This is obviously a problem with the Chi-Square test being too sensitive to discrepancies between the observed and model-generated proportions of respondents (at each scale score/rasch measure) getting the item correct. Which leaves us with the problem of just how to assess fit of items to the model. There are solutions - but somewhat heuristic I m afraid.

26 The next exploration looks directly at a major purported benefit of the Rasch model - the creation of equal-interval, additive units of measurement for the latent variable. Here I comprehensively extended an example briefly presented by Fisher (1992) - where we use a bad ruler (unequal units over a range of measurement) to make ordinal measurement of true equal-interval unit lengths (in cm units). This tests the capacity of the Rasch model to uncover the true equal-interval scale that underlies the raw score measurement.

27 Here, I present 40 objects for measurement to my bad-ruler which consists of 16 unequal divisions of length the objects are actually cm units on a real ruler - expressed in terms of my bad-ruler units. Each measurement is in the form of a dichotomy - a 1 is assigned to a bad measure unit if my cm measure extends beyond than this unit. Where my cm measure is smaller than the remaining bad-ruler units, I assign a 0 to these units. E.g.

28 The real ruler The bad ruler A 1 cm measure on the good ruler would generate the following record: For 2cm etc. In this way, we build up 40 records - which are like responses to items on a test.

29 Fitting these slightly jittered data (for the Rasch is a probabilistic model) we have 40 persons and 16 items to be provided with parameters. The rather simple-minded test here is whether the Rasch model will recover the equal-interval cm scale from the ordinal measures made by me. Fisher claimed it would - and my simplistic reasoning with regard conjoint measurement would seem to suggest it should.

30 The model fits almost perfectly (chi-square probability ~ 0.99) All items fit the model (via chi-square) The actual raw scores for each person correlate with the expected scores computed from the Rasch Modelling. This is no longer any surprise because the model-fit procedure is minimising this discrepancy. The Rasch item location/difficulty parameters correlate with the bad-ruler units. Now this IS interesting.

31 Rasch "Difficulty" parameters and Ordinal units against actual cm measurement A scale for Rasch measures and the Raw Score units Item Difficulty Raw Score Units Measurement in actual cm Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

32 The Rasch estimates are mirroring my badruler units. If we did not know that my 16 units were (in reality) unequally spaced, we would probably treat them as equal-interval, and plot them accordingly which means that the Rasch difficulty/location parameters would also now be equally spaced. Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

33 The Rasch Difficulty Measure for each item "unit". Rasch difficulty parameters vs the Ordinal Rank units The Ordinal "bad-ruler"units

34 A final test, to confirm my suspicion that the Rasch model is NOT able to address the issue of a fundamental unit of measurement! Here, I map my bad-ruler units onto log e (cm).. Then use the units to make measurement as before. The graph on the next slide shows the extent to which my bad-ruler is now making curvilinear measurement of a set of extensive, equal-interval measurement of cm unit objects.

35 Centimetres(cm) Centimetre "objects" vs log(cm) ordinal units Log(cm) with imposed ordinal "ticks"

36 The model fits almost almost perfectly (chi-square probability ~ 0.97) All but 1 items fit the model (via chi-square) The Rasch item location/difficulty parameters correlate with the bad-ruler units. Once again, the Rasch model is seen to attempt to linearise the 16 items and 40 length measures - and it succeeds well. BUT, those item units were mapped onto extensive units of measurement, using a logarithmic concatenation of cm to bad-ruler units. Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

37 Rasch Difficulty and Ordinal Units against actual cm measurement A scale for Rasch measures and the raw score units Item Difficulty Raw Score Units Measurement in actual cm Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

38 The Rasch Difficulty Measure for each item "unit" Rasch Difficulty parameters vs the Ordinal Rank Units The Ordinal "bad-ruler" Units

39 Critical Point Given that the deterministic axioms of Luce and Tukey s (1964) simultaneous conjoint measurement hold for this form of probabilistic model, then Rasch scaling is producing equal interval, additive units of measurement. but of what exactly? By demonstrating its virtual identity with the raw scores and ranked, (but equal-interval ) ordinal units, I am led to conclude it is producing an equal-interval scaling of the numerals representing the ranked items. Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998

40 Critical Point Given my mapped units were equal interval, it is then no surprise that the Rasch locations and rankunit locations are so closely related in a linear function. But, the key issue is that the real unit of measurement (cm) was never exposed by the model. It is this result that causes me to question the automatic use of the model for psychological science investigations.

41 Critical Point My use of the Rasch model here seems to be dependent upon some form of inductive logic - that is, I use the model to determine a unit of unknown meaning. But surely science proceeds by first defining a meaningful (in some theoretical sense) unit, then designs measurement to determine if that unit functions in the manner specified by some theory?

42 Critical Point Thus, if we are to use the Rasch model productively, it cannot be used as an inductive unit-generating procedure, but rather, as part of a hypothetico-deductive process of investigation.

43 So, might we conclude that the Rasch model is of more value to scientific investigation than True-Score Theory?. Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 On balance, YES. Given that a quantitative science requires equal-interval, additive units of measurement for variables, then there really is no alternative to the Rasch model for psychological measurement. However, we have also seen that scaling, in the absence of theory for the fundamental unit of measurement for a latent variable, is not of value except perhaps pragmatically.

44 Four clear, justifiable, and pragmatic reasons to use the Rasch Model - given fit of the model to your data. Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 An explicit ordering of items as a cumulative response scale, on a common linear metric shared with person measures. Additive units of measurement Computation of both item and person reliability Computation of the location-sensitive standard error-of-measurement/information function over the range of the latent variable measures.

45 Conclusions.1 Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 ❶ The Rasch Model is dangerous for the wrong reasons! Whereas I was thinking of it as a means to assist in the development of fundamental standard units of measurement for latent variables, this is not possible without first having a model for what and how these units should be instantiated within a deductive theoretical framework.

46 Conclusions.2 Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 ❷ The problem remains with our conception of the constituent properties of latent variables - and their proposed units of measurement. ❸The Rasch model provides more information about any test respondent, and test items, than does CTT. For pragmatic purposes alone, this is surely of benefit to the applied psychological professions, both test developers and test users.

47 Conclusions.3 Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 ❹ CERTAIN domains of Psychometric tests do have good validity - the recent paper by Schmidt and Hunter (Sept. 1998) in the Psychological Bulletin demonstrates this clearly but also demonstrates the poverty of scientific understanding in this area. Schmidt. F.L. and Hunter, J.E. (1998) The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 85 years of research findings. Psychological Bulletin, 124, 2,

48 Conclusions.4 Paul Barrett: BPS Millenium Conference: Beyond Psychometrics November 1998 ❺ If we question the use of the Rasch model, on the basis of the stability of the constructs we are measuring (because of environmental or situational factors that may change over time), then we are in fact NOT questioning the Rasch model at all, but the very rules and meaning by which we are proposing to instantiate our constructs. To spurn the possibility of equal-interval measurement on this basis is quite wrong. Rather, we need to consider the conceptual status of what it is that we think we are measuring.

49 Addendum.1-12/10/99 Paul Barrett: Addendum October 1999 Following this presentation in November 1998, Ben Wright from the MESA group at Chicago re-analysed my data - and concluded that there was insufficient stochasticity (random error) in my observations. In short, my data may have been artificially too clean for the Rasch model to fit well (as it is a form of probabilistic conjoint scaling, not deterministic). I am not happy with the implications of this, although I see exactly the veracity of his argument. I think our disagreement may lie in the fact that others (like William Fisher Jnr.) see the meaning-measurement unit linkage as a construction that is created after Rasch measurement is created. For me, this is the wrong way round. You first need to define the meaning, then develop the measurement, based upon the conceptualisation of a meaningful unit. Simply scaling a set of items (aka arithmetic items), then deciding that the equal-interval unit can be known as arithmets, is acceptable if the only purpose of the measurement is pragmatic, but not acceptable if the aim of this work is to make statements about the magnitude of some psychological attribute/process that underlies an individual s ability to solve arithmetic items.

50 Addendum.2-12/10/99 So, I am now setting up a better data-generation program that gives me greater control over the amount of error I introduce, and the amount of ordinality in my measurement ruler. Then, I intend to produce some better tests of my propositions. Further, as David Andrich pointed out - he does not use the chi-square statistic as a measure of model or item fit- but rather, uses more a mixture of graphical, tabular, and other data to examine the issue of model fit. However, I note that George Karabatsos (gkarab@lsumc.edu) is working on model fit from the perspective of additive conjoint measurement (ACM). He noted in a recent At face value, this approach to correspond goodness of fit with the ACM axioms may seem convenient and sensible. But the goodness of fit stats are hampered by several issues. As I found out in my dissertation work, and later in several sources (please refer to the bottom of this message), the correspondence between goodness of fit statistics and measurement axioms is far from perfect. For instance, the Rasch model can conclude perfect fit, while the axiom tests reveal significant departures from the measurement model. The rest of the message was almost as bleak for me! (Rasch listserv rasch@acer.edu.au message date...27/09/99). Anyway, time for some more detailed exploration into this measurement model. Paul Barrett: Addendum October 1999

Paul Barrett

For: AERA-D Rasch Measurement SIG. New Orleans, USA, 3 rd April, 2002 Symposium entitled: Is Educational Measurement really possible? Chair and Organizer: Assoc. Prof. Trevor Bond Paul Barrett email: p.barrett@liv.ac.uk