An Overview of Item Response Theory. Michael C. Edwards, PhD

Similar documents
Basic IRT Concepts, Models, and Assumptions

Item Response Theory and Computerized Adaptive Testing

Latent Trait Reliability

Introduction To Confirmatory Factor Analysis and Item Response Theory

Statistical and psychometric methods for measurement: Scale development and validation

Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

How to Measure the Objectivity of a Test

A Markov chain Monte Carlo approach to confirmatory item factor analysis. Michael C. Edwards The Ohio State University

Walkthrough for Illustrations. Illustration 1

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Introduction to Algebra: The First Week

Lesson 7: Item response theory models (part 2)

Classical Test Theory. Basics of Classical Test Theory. Cal State Northridge Psy 320 Andrew Ainsworth, PhD

Classical Test Theory (CTT) for Assessing Reliability and Validity

Measurement Theory. Reliability. Error Sources. = XY r XX. r XY. r YY

Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the Academic Year Using Interim Assessments

Using modern statistical methodology for validating and reporti. Outcomes

Development and Calibration of an Item Response Model. that Incorporates Response Time

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Class Introduction and Overview; Review of ANOVA, Regression, and Psychological Measurement

Assessment, analysis and interpretation of Patient Reported Outcomes (PROs)

Comparing Multi-dimensional and Uni-dimensional Computer Adaptive Strategies in Psychological and Health Assessment. Jingyu Liu

Interpreting and using heterogeneous choice & generalized ordered logit models

Statistical and psychometric methods for measurement: G Theory, DIF, & Linking

ABSTRACT. Yunyun Dai, Doctor of Philosophy, Mixtures of item response theory models have been proposed as a technique to explore

Uni- and Bivariate Power

LECTURE 2: SIMPLE REGRESSION I

HOW TO WRITE PROOFS. Dr. Min Ru, University of Houston

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Generalized Linear Models for Non-Normal Data

DIFFERENTIAL EQUATIONS

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Whats beyond Concerto: An introduction to the R package catr. Session 4: Overview of polytomous IRT models

The application and empirical comparison of item. parameters of Classical Test Theory and Partial Credit. Model of Rasch in performance assessments

TESTING AND MEASUREMENT IN PSYCHOLOGY. Merve Denizci Nazlıgül, M.S.

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Comparing IRT with Other Models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

The Purpose of Hypothesis Testing

ReleQuant Improving teaching and learning in modern physics in upper secondary school Budapest 2015

Summer School in Applied Psychometric Principles. Peterhouse College 13 th to 17 th September 2010

Math 440 Project Assignment

1. THE IDEA OF MEASUREMENT

Chapter 26: Comparing Counts (Chi Square)

Item Response Theory (IRT) Analysis of Item Sets

SPIRITUAL GIFTS. ( ) ( ) 1. Would you describe yourself as an effective public speaker?

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Logistic Regression and Item Response Theory: Estimation Item and Ability Parameters by Using Logistic Regression in IRT.

Overview. Multidimensional Item Response Theory. Lecture #12 ICPSR Item Response Theory Workshop. Basics of MIRT Assumptions Models Applications

Business Statistics. Lecture 9: Simple Regression

Modeling Log Data from an Intelligent Tutor Experiment

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

A Use of the Information Function in Tailored Testing

Because it might not make a big DIF: Assessing differential test functioning

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

Student Questionnaire (s) Main Survey

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Harvard University. Rigorous Research in Engineering Education

Section 4. Test-Level Analyses

Correlation and Regression Bangkok, 14-18, Sept. 2015

CS 361: Probability & Statistics

Logistic Regression: Regression with a Binary Dependent Variable

Paul Barrett

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Likelihood and Fairness in Multidimensional Item Response Theory

Analysis of students self-efficacy, interest, and effort beliefs in general chemistry

WU Weiterbildung. Linear Mixed Models

Ch. 16: Correlation and Regression

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Bayesian Nonparametric Rasch Modeling: Methods and Software

Joint Modeling of Longitudinal Item Response Data and Survival

Application of Item Response Theory Models for Intensive Longitudinal Data

Midterm 1 ECO Undergraduate Econometrics

What is an Ordinal Latent Trait Model?

CS 124 Math Review Section January 29, 2018

UCLA Department of Statistics Papers

Interactions among Continuous Predictors

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

36-720: The Rasch Model

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006

:KDW, V0XOWLSOH5HJUHVVLRQ*RRG)RU"

review session gov 2000 gov 2000 () review session 1 / 38

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Bayesian Estimation An Informal Introduction

Introduction to Confirmatory Factor Analysis

Paradoxical Results in Multidimensional Item Response Theory

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

Linking the Smarter Balanced Assessments to NWEA MAP Tests

Euclidean Geometry. The Elements of Mathematics

ED 357/358 - FIELD EXPERIENCE - LD & EI LESSON DESIGN & DELIVERY LESSON PLAN #4

PIRLS 2016 Achievement Scaling Methodology 1

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Analysing categorical data using logit models

Conditional Standard Errors of Measurement for Performance Ratings from Ordinary Least Squares Regression

Binary Logistic Regression

Transcription:

An Overview of Item Response Theory Michael C. Edwards, PhD

Overview General overview of psychometrics Reliability and validity Different models and approaches Item response theory (IRT) Conceptual framework Operational basics Model variants Advanced topics Questions

Psychometrics According to Guilford, who wrote one of, if not the, first textbook on psychometrics in 1936, psychometric methods are procedures for psychological measurement. Borrowing from N.R. Campbell (1940), Guilford defines measurement as the assignment of numerals to objects or events according to rules. This idea was further popularized by Stevens (1946 & 1951) in his writings about measurement. As a field of study, modern psychometrics is focused on mathematical and statistical models to relate observable data to unobservable constructs.

Measurement The basic idea of testing, generally speaking, is that we an obtain an estimate of an individual s level of some trait/construct of interest by collecting a (relatively) small sample of data relevant to that trait. In educational testing this may involve determining someone s math proficiency by asking them to answer 25 math items. In patient reported outcomes this may involve determining someone s quality of life by asking them to respond to 25 quality of life (QoL) items.

Psychometrics Psychometric models have roots ranging back to early psychophysical experiments in the 1800s as well as some of the foundational work in statistics in the early 1900s. Factor analysis is well over 100 years old (Spearman, 1904) True score theory is over 80 years old (Thurstone, 1931) IRT is over 60 years old (Thurstone, 1953) Generalizability theory is over 50 years old (Cronbach et al., 1963) Computerized adaptive testing has been occurring operationally since the early 1980s

Psychometrics Core Concepts Validity Measuring the right thing. Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, p.13, 1993, emphasis in original) Reliability Measuring the thing right. Specifically, reliability relates to the consistency of scores over replications of the measurement procedure. (Brennan, p.9, 2006)

Psychometrics Approaches Classical Test Theory (CTT) True score theory Coefficient alpha Generalizability Theory CTT + ANOVA Latent Variable Models Factor analysis, IRT Structural equation modeling Latent growth curve modeling

Item Response Theory (IRT) Continuous Outcome Discrete Outcome Observed Predictor OLS Regression Logistic Regression Latent Predictor Factor Analysis Item Response Theory

IRT IRT is closely related to (or a special case of): Factor analysis Structural equation modeling Non-linear mixed models Generalized linear models Logistic regression (or probit regression) Described generally, IRT is a collection of latent variable models that explain the process by which individuals respond to items. It does so in terms of properties related to the person, properties related to the items, and how they interact.

A caveat Models, of course, are never true, but fortunately it is only necessary that they be useful. (Box, p.2, 1979) Our contention is not that item response theory is right (it s not), but that it is a useful model for creating and scoring scales.

IRT One of the most widely used IRT models, the 2PLM is appropriate for dichotomous responses: P u j = 1 θ = 1 1 + exp a j θ b j u j is the observed response to item j a j is the slope for item j b j is the location parameter for item j θ is the latent construct being measured

Model assumptions Local independence Logistic item response function Normal trait distribution

2PLM trace lines Less Pain Needed More Pain Needed 16% 84%

2PLM trace lines More reliable information Less reliable information

2PLM trace lines

Many other models There are dozens, if not more, other kinds of IRT models. Of particular relevance for the PRO world are items which extend the 2PLM to accommodate more than two categories. One popular IRT model designed for more than two ordered categories is the graded response model (GRM).

GRM trace lines I often cry out in pain. Strongly Disagree Disagree Agree Strongly Agree

IRT and Rasch Models There is really no such thing as a Rasch model. They are all IRT models. Some satisfy the requirements of the Rasch philosophy of measurement, others do not. The ones that do are special cases of more general models. With dichotomous data, the 1PL is obtained by constraining all slopes to be equal. This is a nested model and any constraints of this kind can be tested. If it fits fine. Life is easier with the 1PL when it works. The Rasch philosophy of measurement specifies the model in advance and if the data do not fit the data are removed. This is troubling for lots of reasons, but the most obvious would be the potential detrimental impact to validity.

IRT Research and experience with IRT over the past 30 years has demonstrated a number of theoretical and practical advantages over other modeling approaches. These include: Scale development Scoring Differential item functioning (DIF) Computerized adaptive testing (CAT)

IRT Research and experience with IRT over the past 30 years has demonstrated a number of theoretical and practical advantages over other modeling approaches. These include: Scale development Scoring Differential item functioning (DIF) Computerized adaptive testing (CAT)

Scale development The kinds of information we can extract from an IRT analysis allows us to be much more specific about what we want a scale to do. This is a good thing, but requires us to think much more carefully about what we re trying to accomplish with a given scale: Who are we trying to measure? How reliable do our scores need to be at different levels of the construct? How small a change are we interested in being able to detect? What is the primary purpose of the scores?

Information and standard errors

Information and standard errors

95% confidence intervals

95% confidence intervals

Scale development Another nice feature of IRT is that is provides a solid statistical framework for future modifications of existing item banks. This could occur in the form of: Adding new items Revising existing items Revising response format Translations Mode of administration

IRT Research and experience with IRT over the past 30 years has demonstrated a number of theoretical and practical advantages over other modeling approaches. These include: Scale development Scoring Differential item functioning (DIF) Computerized adaptive testing (CAT)

Scoring If you re comfortable with the idea that some items are harder to endorse than others, then you re already seeing one of the most important benefits of using IRT for scoring. Summed scores assume equal weighting. That is, saying yes to one item (or choosing Stongly Agree ) is the same as saying yes to any other item. If you believe items differ in their difficulty/severity, then this doesn t make sense.

Scoring IRT-based scoring uses the item parameters to weight each response based on the properties of that particular item. Each item contributes information to create an overall score. Each item response provides information about where an individual is likely to be situated on the latent construct being measured. We aggregate over all items (with data) to come up with a score estimate for each individual This aggregation is more complex than just adding things up, but it does provide a number of advantages. If you're comfortable with estimating unit-specific treatment impact estimates in random effects modeling of multi-site randomized trials, those Empirical Bayes estimates of random effects are statistical cousins of IRT scaled scores.

Scoring More valid representation of the construct Known metric, easily interpretable More variable than summed scores (more on this shortly ) Conditional standard errors Assuming some linking/equating is done, IRT scale scores from different instruments can be put on same metric

CBCL scoring example Item 1: Argues Item 5: Lying Item 7: Destroys Things 1 2 3 4 5 6 7 8 f p+ MAP se 1 0 0 0 1 0 0 0 19 0.25 0.18 0.55 1 0 1 0 0 0 0 0 1 0.25 0.21 0.54 1 0 0 0 0 1 0 0 4 0.25 0.32 0.51 1 1 0 0 0 0 0 0 18 0.25 0.4 0.49 0 1 0 0 0 0 1 0 1 0.25 0.51 0.47 1 0 0 1 0 0 0 0 7 0.25 0.52 0.46 1 0 0 0 0 0 1 0 6 0.25 0.54 0.46

CBCL scoring example Item 1: Argues Item 5: Lying Item 7: Destroys Things 1 2 3 4 5 6 7 8 f p+ MAP se 1 0 0 0 1 0 0 0 19 0.25 0.18 0.55 1 0 1 0 0 0 0 0 1 0.25 0.21 0.54 1 0 0 0 0 1 0 0 4 0.25 0.32 0.51 1 1 0 0 0 0 0 0 18 0.25 0.4 0.49 0 1 0 0 0 0 1 0 1 0.25 0.51 0.47 1 0 0 1 0 0 0 0 7 0.25 0.52 0.46 1 0 0 0 0 0 1 0 6 0.25 0.54 0.46

CBCL scoring example Item 1: Argues Item 5: Lying Item 7: Destroys Things 1 2 3 4 5 6 7 8 f p+ MAP se 1 0 0 0 1 0 0 0 19 0.25 0.18 0.55 1 0 1 0 0 0 0 0 1 0.25 0.21 0.54 1 0 0 0 0 1 0 0 4 0.25 0.32 0.51 1 1 0 0 0 0 0 0 18 0.25 0.4 0.49 0 1 0 0 0 0 1 0 1 0.25 0.51 0.47 1 0 0 1 0 0 0 0 7 0.25 0.52 0.46 1 0 0 0 0 0 1 0 6 0.25 0.54 0.46

IRT Research and experience with IRT over the past 30 years has demonstrated a number of theoretical and practical advantages over other modeling approaches. These include: Scale development Scoring Differential item functioning (DIF) Computerized adaptive testing (CAT)

DIF Differential item functioning, also known as measurement invariance, occurs when the nature of an item s relation to the latent construct(s) varies as a function of (usually) some known grouping variable. Classic examples include gender, ethnicity, language, and mode of administration. Methods exist to look for DIF and, if it is found, model it. This can allow for a statistically rigorous way to answer questions like: Is the Spanish version behaving the same way as the English version? Is the electronic version behaving the same way as the paper and pencil version? Is the version on the iphone 5S behaving the same way as the version on the Galaxy S3? Are the mothers and fathers behaving the same way when they respond to this scale?

DIF example What if a new medicine improved one particular symptom on a PRO scale, but nothing else? If we use control vs. experimental as the grouping variable, this would show up as DIF in the b-parameter in a 2PLM. This means that, conditional on having the same theta estimate (i.e., same level of the latent construct), the individuals in one group behave differently on that item than individuals in another group. Here, we would see the item become a less severe indicator of the construct being measured. We can have confidence that it isn t an overall improvement in the latent trait as this is being controlled for (i.e., the patterns we would expect to see in the data for these two possible hypotheses are different and discernable).

IRT Research and experience with IRT over the past 30 years has demonstrated a number of theoretical and practical advantages over other modeling approaches. These include: Scale development Scoring Differential item functioning (DIF) Computerized adaptive testing (CAT)

CAT Widely used in educational testing for many years. More recently seen in PRO space (PROMIS). Items are administered in adaptive fashion based on existing estimate of latent construct and which item is most informative (for example). Fastest way to reach a target reliability.

CAT demonstration We ll walk through a static CAT demonstration here, but if you want to try it yourself visit: https://adaptest.vpgcentral.com Click Adaptive IRT in the Example Features section. Runs a real-time simulated adaptive test demonstration using the PROMIS pain bank.

CAT demonstration

More Advanced Topics There are many extensions of the simple models we ve talked about today. Multilevel Multidimensional Non-normal latent traits Non- and semi-parametric item response functions There have also been great strides in the past five years in the assessment of overall model fit.

A closing thought John Tukey once said: Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. (1962, pp.13-14, emphasis in original)

Mike Edwards mce@vpgcentral.com Collaborate. Discover. Apply. www.vpgcentral.com

Vector Psychometric Group (VPG) Founded in 2007 as a psychometric consulting company Added IRT software (flexmirt ) in 2011 and an adaptive testing platform (Adaptest ) in 2009 Managing Partners: Li Cai Mike Edwards R.J. Wirth Employ a team of full-time psychometricians and programmers as well as a network of consultants