Methods for the Comparison of DIF across Assessments W. Holmes Finch Maria Hernandez Finch Brian F. French David E. McIntosh

Size: px

Start display at page:

Download "Methods for the Comparison of DIF across Assessments W. Holmes Finch Maria Hernandez Finch Brian F. French David E. McIntosh"

Dennis Gray
6 years ago
Views:

1 Methods for the Comparison of DIF across Assessments W. Holmes Finch Maria Hernandez Finch Brian F. French David E. McIntosh

2 Impact of DIF on Assessments Individuals administrating assessments must select assessments with the greatest evidence of validity for a given population. A primary component of validity evidence is fairness of assessments across subgroups. Evidence of fairness: A major form is differential item functioning (DIF) analysis.

3 Comparing levels of DIF across assessments Multiple instruments may exist for the same constructs. e.g., Intelligence; Self-efficacy, Learning Styles, Depression Users want to select the assessment that exhibits the least amount of DIF for subgroups. e.g., ethnicity, SES

4 Comparing levels of DIF across assessments DIF methods are effective for flagging DIF. DIF results do not yield easily comparable indices across assessments. Simple comparisons (e.g., # of DIF items across measures), do not provide information about the magnitude of DIF across assessments.

5 Effect size estimates to compare levels of DIF Study Goal: To examine several effect sizes to compare the collective amount of DIF across two or more psychological assessments. Types of effect sizes: Some developed expressly for this purpose Some already exist and we borrow from the literature.

6 Effect Sizes Present 5 different effect sizes: Drawn from different areas from traditional measures (e.g., d) and models (e.g., item response theory).

7 Random effects IRT model-based DIF An IRT model that incorporates a random effect for group membership and any DIF associated with it. P θ = exp 1.7a i θ j b i +Gξ i 1+exp 1.7a i θ j b i +Gξ i Where a i =Discrimination parameter for item i b i =Difficulty parameter for item i θ j =Level on the latent trait for examinee j G =Group membership with focal group=0 and reference group=1 ξ i =DIF effect for item i.

8 Random effects IRT model-based DIF When ξ i = 0 no DIF is present for item i. The log odds ratio, log α MHi, from the Mantel- Haenszel test of DIF is a good estimator of ξ i Camilli & Penfield, 1997 When there is no DIF for any of the items on a scale, the variance of ξ (across all items), τ 2 = 0.

9 Random effects IRT model based DIF I i=1 log α MHi μ 2 I 2 S i τ 2 i=1 = I Where μ =mean of log α MHi across the i items of the measure. S 2 i =Variance of log α MHi for each of the i items of the measure. The weighted estimator is: τ w 2 = Where w i 2 = S i 2 I 2 i=1 w i log αmhi μ 2 I 2 i=1 w i I i=1 w i 2

10 Comparing DIF using the random effects IRT model-based DIF Calculate τ 2 and τ w 2 for each assessment. Values reflect the total variance in the item responses associated with DIF. Calculate the difference for these values for each pair of assessments. Differences departing from 0 suggest varying levels of DIF in the assessments. Select the assessment with least amount of collective DIF for the target variable(s) of interest (e.g., ethnicity, SES).

11 Cohen s d for DIF In the context of meta-analysis, the log odds ratio can be easily converted to Cohen s d. Hasselblad & Hedges, 1995 The log of the odds ratios for each item on a scale obtained using the MH test for DIF can be converted to Cohen s d as: d i = log α MHi 3 π

12 Cohen s d for comparing DIF across assessments Calculate d i for each item; then calculate the mean across items to obtain d. Use absolute values of d i, then we have the unsigned effect size, d u. Use signed values of d i, then we have the signed effect size, d s.

13 Cohen s d The d reflects the average conditional difference in the likelihood of a correct response between the groups of interest. A commonly used and well understood scale.leading to easy interpretation. To ascertain the relative amount of DIF in the scales: Calculate d for each instrument Compare values to determine which has least amount of DIF Make your assessment selection.

14 Logistic regression R Δ 2 for comparing DIF across assessments The change in the variance accounted for in the logistic regression model (R Δ 2 ) for DIF detection group membership (uniform DIF) interaction of group and total score (nonuniform DIF) We propose averaging the R Δ 2 across the items to obtain R Δ 2 A measure of overall DIF in a set of items.

15 Logistic regression R Δ 2 for comparing DIF across assessments The R Δ 2 value shares with Cohen s d the advantage of being on a well known and easily interpretable scale. For comparing the collective DIF across assessments: Calculate R Δ 2 for each assessment Compare and make your assessment selection.

16 DIF effect size-- Steinberg and Thissen The difference in item parameter estimates for two groups can be a measure of DIF effect size. Steinberg &Thissen, 2006 Effect size is intuitive and provides easily interpretable results, particularly for measurement professionals. For uniform DIF, the difference in item difficulty parameters would serve as the effect size of interest.

17 DIF effect size-- Steinberg and Thissen For each item on each assessment, the difference between the reference and focal groups item difficulty values are calculated. The mean of these differences is calculated. The scale with the lower mean of the difference in difficulties is considered to have the least amount of DIF. Make your assessment selection.

18 SIBTEST SIBTEST is an effective tool for assessing both DIF and differential bundle functioning (DBF). The SIBTEST statistic for uniform DBF in a set of items is calculated as: K k=0 β Bundle = p k Y Rk Y Fk Where p k = Proportion of individuals with matching subtest score of k Y Rk = Adjusted mean score for the reference group on the bundle for individuals with matching subtest score k Y Fk = Adjusted mean score for the focal group on the bundle for individuals with matching subtest score k

19 SIBTEST to compare DIF across assessments β Bundle is a measure of the difference in conditional weighted mean performance for the items in a bundle between two groups. This statistic is calculated for each measure The one with the lower value is determined to contain the least amount of overall DIF. Make your assessment selection.

20 Summary of 5 effect size measures for differences in DIF Difference in τ 2 : Reflects the difference in variance in the scales associated with DIF. Difference in d : Reflects the difference in the average conditional likelihood of a correct response across items on the assessments. Difference in R Δ 2 : Reflects the difference in the average conditional proportion of variance in the item responses associated with group membership. Difference in S-T: Reflects the difference in the mean item difficulty differences between the two groups for whom DIF is assessed. Difference in β Bundle : Reflects the difference in the conditional difference in the groups scores on the assessments.

21 Simulation Study Results: Which one works? When assessments are of the same length, the methods are all equally able to detect which assessment contains more DIF. When assessments have different numbers of items, d and R Δ 2 are overly likely to indicate the shorter item contains more DIF. Concluded that we use τ 2, τ w 2 or β Bundle to make comparisons regarding the amount of DIF in two or more scales. Most accurate across a wide variety of conditions.

22 The current study--purpose Compare the amount of DIF on three separate assessments of intelligence that are commonly used by school psychologists. Participants were each given all 3 measures. Evaluate the various effect sizes with real data. Of particular interest was examining DIF associated with mother s education level.

23 Method Sample: 200 preschool children (103 females). Age: 4 years 0 months to 5 years 11 months, with a mean (standard deviation) age of months (5.38). 62% (124) were Caucasian 32% had mothers with a high school education or less.

24 Method The children were administered: Woodcock Johnson-III cognitive assessment battery (WJ-III) Kaufman Assessment Battery for Children-Second Edition (KABC-II) Stanford-Binet Intelligence Scales, Fifth Edition (SBV). All children received all measures: Counterbalancing of administration controlled for order effects.

25 Method Grouping variable: Mother s education level. Group 1: high school or less; Group 2: more than high school. Each effect size described previously was calculated for the first 7 items on each subtest of each assessment. Focus on these 7 items because these are typically used with all examinees, whereas later items are only administered to higher performing or older individuals. Subtests were matched into Catell-Horn-Carroll (CHC) theory based constructs for comparison purposes.

26 Recommendations for interpreting DIF effect size measures Cohen s d: Small ( ), Medium ( ), Large (0.8+) τ 2 : Small (0-0.07), Medium ( ), Large (0.14+) R Δ 2 : Negligible ( ), Moderate ( ), Large (0.07+) SIBTEST: Negligible ( ), Moderate ( ), Large (0.088+)

27 Results: Fluid Intelligence *Least amount of DIF for an index Test (Items) d u d s R Δ 2 S-T τ 2 τ w 2 b SB Nonverbal Quant Reason (30) * SB Verbal Quant Reason (30) 0.19* 0.19* 0.01* SB Verbal Fluid Reason (13) KABC Pattern Reason (23) * * 0.13 WJ Concept Formation (40) * WJ Analysis Synthesis (35) * *

28 Results: Crystallized Intelligence *Least amount of DIF for an index Test (Items) d u d s R Δ 2 S-T τ 2 τ w 2 b SB Nonverb Know (30) * KABC Verb Know (90) * KABC Riddle (51) WJ Verbal Comp A (23) 0.08* 0.08* * WJ Verbal Comp B (23) * * WJ Verbal Comp C (15) * WJ Verbal Comp D (18) * WJ General Info (26) * * -0.07

29 Results: Short Term Memory *Least amount of DIF for an index Test (Items) d u d s R Δ 2 S-T τ 2 τ w 2 b SB Nonverb Work Memory (34) * -0.02* SB Verbal Work Memory (15) KABC Number Recall (22) * KABC Word Order (27) * * 0.10 WJ Numbers Reversed (30) * * WJ Memory for Words (24) 0.27* 0.27* 0.01*

30 Results: Visual Processing *Least amount of DIF for an index Test d u d s R Δ 2 S-T τ 2 τ w 2 b SB Verbal Visual Spatial (30) 0.15* 0.15* 0.01* SB Nonverbal Visual Spatial (22) * * KABC Triangles (25) * KABC Concept Thinking (28) * * KABC Face Recognition (21) * 0.18 WJ Spatial Relations (33) * WJ Picture Recognition (24)

31 Results Auditory Processing *Least amount of DIF for an index Test (Items) d u d s R Δ 2 S-T τ 2 τ w 2 b WJ Sound Blending (33) 0.13* 0.13* 0.004* 0.06* * WJ Auditory Attention (50) * 0.01* -0.05

32 Results: Processing Speed *Least amount of DIF for an index Test (Items) d u d s 2 R Δ S-T τ 2 2 τ w b WJ Visual Matching 1 (26) * * WJ Visual Matching 2 (26) * WJ Decision Speed (40) 0.27* 0.27* 0.01* 0.37* * -0.05*

33 Conclusions The 5 effect sizes were useful in identifying specific subtests within each CHC factor that displayed the least amount of DIF with respect to mother s education. School psychologists, and others, can use this information to select the instrument that will provide the least biased assessments for target population. These results in combination with the simulation results support these effect sizes.

34 Conclusions The impact of DIF can be conceptualized as group differences in conditional item difficulty, conditional probability of a correct response, conditional performance on the set of items as a whole, or variance in the item responses associated with DIF. The specific results presented here revealed that the KABC-II had either the lowest, or next to lowest amount of DIF for those CHC domains in which it had tests. With regard to τ w 2, the KABC-II was generally the preferred method with this age group and in consideration of parental education level, in terms of the amount of DIF. The SBV tended to exhibit the most DIF across CHC domains. Recommendation: Use KABC-II as the base test and supplement with the WJ-III in a crossbattery assessment strategy.

35 Future Directions Investigate the indices under various conditions and for different types of DIF non-uniform Make these indices easy to obtain in software used for DIF/DBF analysis A standard issue is making new helpful indices easy to obtain and understand to the practitioner.

Because it might not make a big DIF: Assessing differential test functioning

Because it might not make a big DIF: Assessing differential test functioning David B. Flora R. Philip Chalmers Alyssa Counsell Department of Psychology, Quantitative Methods Area Differential item functioning