Because it might not make a big DIF: Assessing differential test functioning

Similar documents
Methods for the Comparison of DIF across Assessments W. Holmes Finch Maria Hernandez Finch Brian F. French David E. McIntosh

PIRLS 2016 Achievement Scaling Methodology 1

Comparison of parametric and nonparametric item response techniques in determining differential item functioning in polytomous scale

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

Whats beyond Concerto: An introduction to the R package catr. Session 4: Overview of polytomous IRT models

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

Basic IRT Concepts, Models, and Assumptions

An Overview of Item Response Theory. Michael C. Edwards, PhD

Ability Metric Transformations

Psychology 282 Lecture #3 Outline

Comparison between conditional and marginal maximum likelihood for a class of item response models

Equating Tests Under The Nominal Response Model Frank B. Baker

Item Response Theory (IRT) Analysis of Item Sets

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Plausible Values for Latent Variables Using Mplus

Development and Calibration of an Item Response Model. that Incorporates Response Time

Item Response Theory and Computerized Adaptive Testing

Summer School in Applied Psychometric Principles. Peterhouse College 13 th to 17 th September 2010

Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA

Walkthrough for Illustrations. Illustration 1

Suppose we obtain a MLR equation as follows:

Overview. Multidimensional Item Response Theory. Lecture #12 ICPSR Item Response Theory Workshop. Basics of MIRT Assumptions Models Applications

On the Construction of Adjacent Categories Latent Trait Models from Binary Variables, Motivating Processes and the Interpretation of Parameters

A Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data

Application of Plausible Values of Latent Variables to Analyzing BSI-18 Factors. Jichuan Wang, Ph.D

Lesson 7: Item response theory models (part 2)

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

What is an Ordinal Latent Trait Model?

Using modern statistical methodology for validating and reporti. Outcomes

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

UCLA Department of Statistics Papers

Introduction To Confirmatory Factor Analysis and Item Response Theory

A multivariate multilevel model for the analysis of TIMMS & PIRLS data

Latent Trait Reliability

Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees

Basics of Modern Missing Data Analysis

Statistical and psychometric methods for measurement: Scale development and validation

Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

The Discriminating Power of Items That Measure More Than One Dimension

General structural model Part 2: Categorical variables and beyond. Psychology 588: Covariance structure and factor models

Statistical and psychometric methods for measurement: G Theory, DIF, & Linking

The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model

Application of Item Response Theory Models for Intensive Longitudinal Data

LSAC RESEARCH REPORT SERIES. Law School Admission Council Research Report March 2008

Identifying and accounting for outliers and extreme response patterns in latent variable modelling

Generalized Linear Models for Non-Normal Data

A Note on the Equivalence Between Observed and Expected Information Functions With Polytomous IRT Models

Georgia State University. Georgia State University. Jihye Kim Georgia State University

Don t be Fancy. Impute Your Dependent Variables!

Using Graphical Methods in Assessing Measurement Invariance in Inventory Data

Using DIF to Investigate Strengths and Weaknesses in Mathematics Achievement. Profiles of 10 Different Countries. Enis Dogan

Observed-Score "Equatings"

()#* D => % 1 <A) )B #C 1 1> 5

Scaling Methodology and Procedures for the TIMSS Mathematics and Science Scales

Bayesian Nonparametric Rasch Modeling: Methods and Software

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

Contents. 3 Evaluating Manifest Monotonicity Using Bayes Factors Introduction... 44

Paul Barrett

A White Paper on Scaling PARCC Assessments: Some Considerations and a Synthetic Data Example

Measurement Invariance across Groups in Latent Growth Modeling

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY. Yu-Feng Chang

A NEW MODEL FOR THE FUSION OF MAXDIFF SCALING

Studies on the effect of violations of local independence on scale in Rasch models: The Dichotomous Rasch model

1. THE IDEA OF MEASUREMENT

Running head: LOGISTIC REGRESSION FOR DIF DETECTION. Regression Procedure for DIF Detection. Michael G. Jodoin and Mark J. Gierl

SAMPLING- Method of Psychology. By- Mrs Neelam Rathee, Dept of Psychology. PGGCG-11, Chandigarh.

Conditional Standard Errors of Measurement for Performance Ratings from Ordinary Least Squares Regression

A Goodness-of-Fit Measure for the Mokken Double Monotonicity Model that Takes into Account the Size of Deviations

Logistic Regression and Item Response Theory: Estimation Item and Ability Parameters by Using Logistic Regression in IRT.

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Memorandum. From: Erik Hale, Meteorology Group Assistant Manager Date: January 8, 2010 Re: NRG #40 Transfer Function Recommendation.

Introduction to Design of Experiments

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

The Use of Quadratic Form Statistics of Residuals to Identify IRT Model Misfit in Marginal Subtables

Notes on Maxwell & Delaney

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

Classical Test Theory (CTT) for Assessing Reliability and Validity

Contrasts and Correlations in Theory Assessment

arxiv: v1 [stat.ap] 11 Aug 2014

Some methods for handling missing values in outcome variables. Roderick J. Little

Association Between Variables Measured at the Ordinal Level

Bayesian Analysis of Latent Variable Models using Mplus

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

On the Use of Nonparametric ICC Estimation Techniques For Checking Parametric Model Fit

Educational and Psychological Measurement

Part 8: GLMs and Hierarchical LMs and GLMs

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

For additional copies write: ACT Research Report Series PO Box 168 Iowa City, Iowa by ACT, Inc. All rights reserved.

Comparison for alternative imputation methods for ordinal data

The Effect of Differential Item Functioning on Population Invariance of Item Response Theory True Score Equating

Likelihood and Fairness in Multidimensional Item Response Theory

TYPE I ERROR AND POWER OF THE MEAN AND COVARIANCE STRUCTURE CONFIRMATORY FACTOR ANALYSIS FOR DIFFERENTIAL ITEM FUNCTIONING DETECTION:

AC : GETTING MORE FROM YOUR DATA: APPLICATION OF ITEM RESPONSE THEORY TO THE STATISTICS CONCEPT INVENTORY

N J SS W /df W N - 1

What Rasch did: the mathematical underpinnings of the Rasch model. Alex McKee, PhD. 9th Annual UK Rasch User Group Meeting, 20/03/2015

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

Transcription:

Because it might not make a big DIF: Assessing differential test functioning David B. Flora R. Philip Chalmers Alyssa Counsell Department of Psychology, Quantitative Methods Area

Differential item functioning Differential item functioning (DIF) refers to any situation in which an item within a test measures an intended construct differently for one subgroup of a population than it does for another. In short, DIF is item bias. DIF is often assessed using item response theory (IRT).

Differential item functioning (DIF) DIF can be assessed for items fitted to any IRT model, including three-parameter logistic (3PL) and graded-response models (GRM). An item has DIF if two respondents from distinct groups who have equal levels of the latent trait ( ) do not have the same probability of endorsing response k for that item: P u k P u k 1 2 ( i ) ( i ) where the super-script indexes group membership. DIF is readily assessed using a multiple-group generalization of an item response model. E.g., for the two-parameter logistic (2PL) model: g 1 Pu i( i 1 ) g g 1 exp( a ( b )) i i

g 1 Pu i( i 1 ) g g 1 exp( a ( b )) i i a b a 1 2 b 1 2 a b a 1 2 b 1 2 a b a 1 2 i i b 1 2 i i

Our questions: Of course DIF is a property of an individual item. To what extent does DIF lead to biased total scores, that is, differential test functioning (DTF)? Can we quantify DTF?

If a subset of items has DIF effects which consistently favor one group over another, then the overall impact of DIF on total test scores may be substantial. But DIF impact on total scores may be small when -individual DIF effects are small, especially if the test is long and only a small number of items show DIF. -DIF effects in some items favor one group, but DIF in other items favors the other group.

Test scoring with IRT First, represent the observed item response as a function of, given estimated item parameters i (item score function): K S (, ) ( k 1) P( u k, ) i i i k 1

From these item-score functions, we can then represent the expected total score as a function of, given the set of all item parameters, : T(, ψ) Si(, i) I i 1 This total-score function can be generalized to the multiple-group context: g g g T(, ψ ) Si (, i ) I i 1 Then we can visualize how individual DIF effects accumulate across items to create distinct total-score functions, or DTF, across groups:

Test characteristic curves by group No DTF DTF consistently favoring group 1 DTF favors group 1 at high, DTF favors group 2 at low. DTF mostly favoring group 2

Quantifying DTF 1 2 sdtf [ T (, ) T (, )] g( ) d ψ ψ 1 2 udtf T (, ) T (, ) g( ) d ψ ψ g( ) is a weighting function that makes these measures averages rather than total areas. Plot depicts large udtf but small sdtf. If sdtf = 1, then group 1 total scores are one point higher than group 2 scores on average (despite equal ). Estimates of udtf and sdtf are obtained with numerical quadrature. These estimates are subject to sampling error in g

ψˆ ~ N( ψ, ( ψˆ)) where ( ψˆ ) is the inverse of the observed information matrix. Idea based on Thissen & Wainer (1990): Stochastically impute plausible values of the item parameters from the MVN distribution of ˆψ ; calculate the sdtf and udtf statistics in each imputed ˆψ ; use collection of imputed ψ ˆ s to form confidence envelopes for TCCs and confidence intervals for sdtf and udtf.

95% confidence envelopes for two TCCs

Comparison with a previous approach Our sdtf and udtf offer several advantages over developed by Raju et al. (1995): DTF No need to rely on stand-in latent trait estimates,, which makes DTF overly sensitive to sample characteristics and introduces additional uncertainty. Choice of reference vs. focal group remains arbitrary, but for DTF it affects prediction of stand-in ˆ estimates. sdtf and udtf remain in the metric of expected test scores. ˆ

Simulation results When no DIF existed for any items, omnibus tests for sdtf and udtf retained nominal or slightly conservative (for sdtf) Type I error rates. sdtf was more effective at detecting differential scoring when there was systematic DIF in intercepts, whereas udtf was more effective when there was systematic DIF in slopes. When bidirectional DIF existed, these DIF effects cancelled each other out to produce small DTF statistics. Overall, the proposed DTF statistics demonstrated desirable properties that have not been obtained for previous DTF methods.

Example application Responses from the General Self-Efficacy Scale assessed for DIF and DTF across samples from Canada (n = 277) and Germany (n = 219). 10 items with 4-point ordinal response scale assessing degree to which one generally views one s own actions as responsible for good outcomes Graded response model DIF is present in 6 of the 10 items how does that manifest at the total score level?

Generating 1000 imputations for the DTF statistics resulted in significant sdtf (p =.002), but the effect was small: sdtf = -0.629 (95% CI: -1.038 to -0.270) udtf = 0.663 (95% CI: 0.297 to 1.079) Bias in the total scores was approximately 0.629 raw score points (or 1.57% of possible total score = 40) in favor of the German population.

At average to lower levels of, Canadians tend to score lower on the GSE. Shaded areas show 95% confidence envelopes.

Ignoring measurement bias, the mean total score was slightly higher among Germans (M = 30.54) than Canadians (M = 29.94), but not significantly, p =.19. But IRT analyses showed that Germans had a lower latent mean ( = -.11) than Canadians ( = 0). This discrepancy occurred because of DTF that favored the German group overall (i.e., observed German total scores were higher than they should be). But this DTF effect is quite small, despite that 6 of 10 items showed DIF.

Conclusion Differential item functioning does not necessarily produce badly biased test scores. sdtf and udtf measure the extent to which DIF effects accrue across an entire test to produce biased test scores. Sampling variability of sdtf and udtf can be effectively measured using a stochastic imputation procedure. Researchers can use our DTF measures to assess the extent to which measurement bias influences test scores overall as well as within specific ranges of.

References Chalmers, R.P., Counsell, A., & Flora, D.B. (in press). It might not make a big DIF: Improved statistics for Differential Test Functioning. Educational and Psychological Measurement Raju, N.S., van der Linden, W.J., & Fleer, P.F. (1995). IRTbased internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353 368. Thissen, D., & Wainer, H. (1990). Confidence envelopes for item response theory. Journal of Educational Statistics, 15, 113 128.