Item Response Theory and Computerized Adaptive Testing

Size: px

Start display at page:

Download "Item Response Theory and Computerized Adaptive Testing"

Arline Perry
6 years ago
Views:

1 Item Response Theory and Computerized Adaptive Testing Richard C. Gershon, PhD Department of Medical Social Sciences Feinberg School of Medicine Northwestern University May 20, 2011

2 Item Response Theory versus Classical Test Theory Uses of IRT Item Banking Short Forms Computerized Adaptive Tests

3 Psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge (achievement), abilities, attitudes, and personality traits. Source-Wikipedia

4 Measurement requires the concept of an underlying trait that can be expressed in terms of more or less Test items are the operational definition of the underlying trait Test items - can be ordered from easy to hard Test takers - can be ordered from less able to more able

5 A latent trait is an unobservable latent dimension that is thought to give rise to a set of observed item responses. I am too tired to do errands False True Low Severe Fatigue

6 Latent Traits (con t) These latent traits (constructs, variables, q) are measured on a continuum of severity. I am too tired to do errands? energetic False True severe Fatigue

7 Equal Interval Measure Test-takers and items are represented on the same scale Item calibrations are independent of the test-takers used for calibration Candidate ability estimates are independent of the particular set of items used for estimation Measurement precision is estimated for each person and each item

8 Item Difficulty = Severity = Measure = Theta = Item Calibration = Location Person Ability = Measure = Theta = Person Calibration = Location

9 Test-takers and Items are Represented on the Same Scale Item Difficulty Easy Low Person QOL Hard High

10 Discrimination = the degree to which an item discriminates person ability Item Information = the area where an item discriminates Test Information = the area where the test discriminates

11 Item Parameters = IRT statistics about an item Primary: Item Difficulty Often: Item Discrimination Sometimes: Guessing Lots of other ugly looking numbers

12 Differential Item Functioning (DIF) Does an item have different item parameters for different subgroups? Gender Race Age Disease

13 Rasch model one parameter logistic model (1PL) Two parameter logistic model (2PL) Three parameter logistic model (3PL)

14 How to choose an appropriate IRT Model OR My religion is better than your religion!

15 You are about to see about to see mathematical formulas!

16 One Parameter Logistic Model e (ability - difficulty) P 1,0 = 1 + e (ability - difficulty) When the difficulty of a given item exactly matches the Examinee s ability level, then the person has 50% chance of answering that item correctly: P 1,0 = e (0) 1 + e (0) = 1 2 =.50

17 Only option for small sample sizes Often the real model underlying a test labeled as three parameter Less costly The simple solution is always the best

18 Two Parameter Logistic Model P 1,0 = e a (ability - b) 1 + e a (ability - b) Two parameters a=discrimination b=item Difficulty

19 a=.5,b=.5,c=.1 a=1.5,b=.5,c=.1 a=2.5,b=.5,c=.1

20 Three Parameter Logistic Model P 1,0 = c + (1-c) e a (ability - b) 1 + e a (ability - b) Three parameters a= Discrimination b= Item Difficulty c= Guessing

21 Requires a large sample size Significant research demonstrating that theoretically 3PL is better, but practically has little advantage over 1PL Most accepted theoretical model

22 a=1.5,b=.5,c=.1 a=2.5,b=.5,c=.25

23 One Parameter Rating Scale Model Partial Credit Model Two Parameter Graded Response Model Generalized Partial Credit Model

24 There are also IRT models which consider more than one unidimensional trait at a time

25 How does IRT differ from conventional test theory?

26 An individual takes an assessment Their total score on that assessment is used for comparison purposes High Score The person is higher on the trait Low Score-The person is lower on the trait

27 Each individual item can be used for comparison purposes Person endorses better rating on hard items - The person is higher on the trait Person endorses worse rating on easy items - The person is lower on the trait Items that measure the same construct can be aggregated into longer assessments

28 CTT IRT Reliability is based upon the total test. Regardless of patient ability, reliability is the same. Reliability is calculated for each patient ability and varies across the continuum. Typically, there is better reliability in the middle of the distribution.

29 CTT Validity is based upon the total test. Typically, validity would need to be re-assessed if the instrument is modified in any way. IRT Validity is assessed for the entire item bank. Subsets of items (full length tests, short forms and CAT) all inherit the validity assessed for the original item bank.

30 How Scores Depend on the Difficulty of Test Items Very Easy Test 1 8 Person Expected Score 8 Very Hard Test Person Expected Score Medium Test 1 Person Expected Score 5 8 Reprinted with permission from: Wright, B.D. & Stone, M. (1979) Best test design, Chicago: MESA Press, p. 5.

31 Raw Scores vs. IRT Measures IRT has Equal Interval Measurement Raw: 4 Item Test Logit Measures:

32 How Differences Between Person Ability and Item Difficulty Ought to Affect the Probability of a Correct Response Person Ability Item Difficulty p >. 5 Person Ability p <. 5 Person Ability Item Difficulty Item Difficulty p =. 5 Reprinted with permission from: Wright, B.D. & Stone, M. (1979) Best test design, Chicago: MESA Press, p. 13.

33 The Item Characteristic Curve

34 I Have a Lack of Energy Traditional Test Theory 0 = Very Much 1 = Quite a Bit 2 = Somewhat 3 = A Little Bit 4 = Not at All

35 I Have a Lack of Energy Traditional Test Theory 0 = Very Much 1 = Quite a Bit 2 = Somewhat 3 = A Little Bit 4 = Not at All Item Response Theory

36 The IRT Reality of a 10 Point Rating-Scale Item No Pain Worst Pain

37 Probability Curve I have a lack of energy This is an Item Characteristic Curve (ICC) for a rating scale item (each option has its own curve) Trait Measure 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

38 Probability Curve I have a lack of energy Trait Measure 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

39 Probability Curve I have a lack of energy Trait Measure 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

40 Probability Curve I have a lack of energy Trait Measure 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

41 Probability Curve I have a lack of energy Trait Measure 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

42 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

43 Probability of Response None of the time A little of the time Some of the time Most of the time All of the time Energetic Fatigue q Severe Fatigue

44 Probability of Response All of the time Most of the time Some of the time A little of the time None of the time Energetic Fatigue q Severe Fatigue

45 Probability of Response None of the time A little of the time Some of the time All of the time Most of the time Energetic Fatigue q Severe Fatigue

47 Short Forms 5-7 Items in each HRQL Area Constructed to cover full range of trait OR Multiple forms constructed to only cover a narrow range of trait (e.g., high, medium, or low) Computerized Adaptive Testing (CAT) Custom individualized assessment Suitable for clinical use Accuracy level chosen by researcher Source: Expert Rev. of Pharmoeconomics Outcomes Res. (2003) Emotional Distress Pain Physical Function Item Bank Item40 Item38 Item36 Item34 Item32 Item30 Item28 Item26 Item24 Item22 Item20 Item18 Item16 Item14 Item12 Item10 Item8 Item6 Item4 Item2 Prostate Cancer Item40 Item38 Item34 Item32 Item26 Item 22 Item 18 Item 16 Item 8 Item 2 3 Diseases 3 Trials Custom Item Selection Breast Cancer Item 36 Item 34 Item 32 Item 28 Item 26 Item 22 Item 14 Item 10 Item 2 3 Unique Instruments Brain Tumor Item 40 Item 32 Item 24 Item 16 Item 8 Each based on content interest of individual researchers

48 5-7 Items in each HRQL Area Constructed to cover full range of trait OR Multiple forms constructed to only cover a narrow range of trait (e.g., high, medium, or low) Emotional Distress Pain Physical Function Item Bank Item40 Item38 Item36 Item34 Item32 Item30 Item28 Item26 Item24 Item22 Item20 Item18 Item16 Item14 Item12 Item10 Item8 Item6 Item4 Item2 Source: Expert Rev. of Pharmoeconomics Outcomes Res. (2003)

49 Depression Depressive Symptoms Form C Depression Symptoms Form A Depression Symptoms Form B no depression mild depression moderate depression severe depression extreme depression Depressive Symptoms Item Bank Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item n

Custom individualized assessment Suitable for clinical use Accuracy level chosen by researcher Emotional Distress Pain Physical Function Item Bank Item40 Item38 Item36

50 Custom individualized assessment Suitable for clinical use Accuracy level chosen by researcher Emotional Distress Pain Physical Function Item Bank Item40 Item38 Item36 Item34 Item32 Item30 Item28 Item26 Item24 Item22 Item20 Item18 Item16 Item14 Item12 Item10 Item8 Item6 Item4 Item2 Source: Expert Rev. of Pharmoeconomics Outcomes Res. (2003)

51 Source: Expert Rev. of Pharmoeconomics Outcomes Res. (2003) Emotional Distress Pain Physical Function Item Bank Item40 Item38 Item36 Item34 Item32 Item30 Item28 Item26 Item24 Item22 Item20 Item18 Item16 Item14 Item12 Item10 Item8 Item6 Item4 Item2 Prostate Cancer Item40 Item38 Item34 Item32 Item26 Item 22 Item 18 Item 16 Item 8 Item 2 3 Diseases 3 Trials Custom Item Selection Breast Cancer Item 36 Item 34 Item 32 Item 28 Item 26 Item 22 Item 14 Item 10 Item 2 3 Unique Instruments Brain Tumor Item 40 Item 32 Item 24 Item 16 Item 8 Each based on content interest of individual researchers

52 Create a standard static instrument Construct short forms Enable CAT Select items based on unique content interests and formulate custom short-form or full-length instruments

53 In every case, using a validated, pre-calibrated item bank allows any of these instruments to be pre-validated and produce standardized scores on the same scale

54 Computerized Adaptive Testing

55 What is Computerized Adaptive Testing? Shorter Targeting Computerized Algorithm

56 Armed Services Vocational Aptitude Battery (ASVAB)

59 ACCUPLACER OnLine

63 Low Pass High Able Point Able J? L? J? J? PASS!

64 Low Pass High Able Point Able L? J? L L?? FAIL

65 Binary search

66 Low TRAIT High Result: Medium-High on the Trait 3

67 Heterogeneous populations Diagnostic tests On-demand testing Long tests

68 Small populations Difficulty in calibrating items Higher administration costs

69 Calibrated item bank Administration software

70 You probably already have done the hard work! It s usually the same as for CBT.

71 Create an Item Bank Part 1 Item sources Previous exams With a CAT bank fewer constraints exist regarding item re-use Write new items

72 Create an Item Bank Part 2 Item quality Statistics relevant to CAT (not necessarily print) Difficulty Other variables relevant to selected IRT model

73 Analyze item level data from previously administered items Preferably using raw person-data Alternatively, could create raw estimates from p-values Several software packages exist for this purpose

74 Pilot (beta) new test items On paper On computer Typically requires a psychometrician to supervise analyses

75 Starting rule With item which provides maximum information At cut point

76 Stopping Rule Fixed length Variable length by Total Test/Subtest Calculated Specified precision of measure Specified confidence in a pass/fail decision Maximum item count Minimum item count

77 Content balancing None Fixed percentage

78 Testing new items (beta testing, field testing, experimental items)

79 Person ability algorithm Item selection algorithm Test difficulty Maximum jump size Content issues Item exposure control Option to not allow same items to be used during retesting Overlapping items (items that cue other items)

80 Entry= 1 MLT Ver: 10/01/01 Clear Pass Tested: 01/28/02 Status: 2 P A S Item AN Cont Diff Ans = Time! Meas SE ++++* * *+++S ++++* * * BBN o 2' CHE o 2' HEM o 0' * + X * MIC o 0' * + X > IMM o 0' * + X > UA o 0' * + X > BBN o 0' * + X > CHE o 0' * + X > HEM o 0' *+ X > MIC ' *+ X > BBN o 1' * + X > CHE o 1' * + X > HEM o 1' * + X * MIC o 1' *+ X * IMM o 2' *+ X > UA ' * + X > BBN o 0' * X > CHE ' * X > HEM o 0' * X > MIC o 1' * X > BBN o 2' * X > CHE o 0' * X > HEM o 0' * X > MIC = 2' * X > IMM o 1' * X > UA o 0' * X > BBN o 0' * X > CHE o 1' X * HEM o 0' * X * MIC o 0' * X > BBN o 0' * X > CHE o 0' * X > HEM o 0' * X >

81 Entry= 1 HT Ver: 01/01/02 Clear Fail Tested: 01/28/02 Status: 1 A S Item AN Cont Diff Ans = Time! Meas SE ++++* * * S+* * * ST ' FIX ' * +X * LO ' * X + * MIC ' * X+ * ST ' * X + * PRO ' * X + * ST ' * X + * FIX ' * X +* ST ' * X LO ' * X * ST ' * X * MIC ' * X PRO ' * X ST ' * X * FIX ' * X * ST ' * X * FIX ' * X * ST ' * X * LO ' * X * ST ' * X * MIC ' * X * PRO ' * X * ST ' * X * FIX ' * X * ST ' * X * ST ' * X * FIX ' * X * LO ' * X * MIC ' * X * ST ' * X * PRO ' * X * ST ' * X * FIX ' * X * + P

82 Entry= 1 PBT Ver: 10/01/01 ested: 01/26/02 tatus: 1 Fence Sitter P A S Item AN Cont Diff Ans = Time! Meas SE ++++* * *+S * * * SC o 0' LO o 1' SPH o 1' * + X * SC ' * + X * AP = 0' * + X * SC o 0' * + X * LO o 0' * + X * SC o 0' * + X SPH o 0' * + X * * SC o 0' * + X * SC o 0' X LO = 0' * + X * * SPH o 0' *+ X SC o 0' * + X * * AP o 0' X * SC o 0' * + X * LO o 0' * + X * SC o 0' *+ X * SPH o 0' X * SC o 0' X * SC o 0' * + X LO o 0' * X * * SPH o 0' * + X * SC ' X AP o 0' * X * * SC = 0' *+ X LO o 0' * X * * SC o 0' * X * SPH o 0' *+ X SC o 0' * X * * SC o 0' X * LO o 0' * X SPH o 0' X * * SC ' * X *

83 Information Function 0.9 Simulate Measure = Item Meas SE GP1 I have a lack of energy 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

84 Information Function 0.9 Simulate Measure = Item Meas SE GP1 I have a lack of energy 0 = Very Much; 1 = Quite a Bit; 2 = Somewhat; 3 = A Little Bit; 4 = Not at All

85 Information Function 1.2 Simulate Measure = Item Meas SE F65

86 Information Function 1.2 Simulate Measure = Item Meas SE F64

87 Information Function 1.2 Simulate Measure = Item Meas SE F68

88 Item Meas SE Simulate Measure =

89 Simulate Measure = 15 Item Meas SE

90 Item Meas SE Simulate Measure =

91 FATEXP 20 FATEXP 5 FATEXP 18 FATIMP 33 FATIMP 30 In the past 7 days Never Rarely How often did you feel tired? How often did you experience extreme exhaustion? How often did you run out of energy? How often did your fatigue limit you at work (include work at home)? How often were you too tired to think clearly? Sometimes Often Always FATIMP 21 FATIMP 40 How often were you too tired to take a bath or shower? How often did you have enough energy to exercise strenuously? Reprinted with permission of the PROMIS Health Organization and the PROMIS Cooperative Group 2007.

92 CAT in Assessment Center

An Overview of Item Response Theory. Michael C. Edwards, PhD

An Overview of Item Response Theory Michael C. Edwards, PhD Overview General overview of psychometrics Reliability and validity Different models and approaches Item response theory (IRT) Conceptual framework