Likelihood and Fairness in Multidimensional Item Response Theory

Similar documents
Paradoxical Results in Multidimensional Item Response Theory

Paradoxical Results in Multidimensional Item Response Theory

Overview. Multidimensional Item Response Theory. Lecture #12 ICPSR Item Response Theory Workshop. Basics of MIRT Assumptions Models Applications

Lesson 7: Item response theory models (part 2)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Stat 5101 Lecture Notes

One-parameter models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Conditional probabilities and graphical models

36-720: The Rasch Model

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Computerized Adaptive Testing With Equated Number-Correct Scoring

2012 Assessment Report. Mathematics with Calculus Level 3 Statistics and Modelling Level 3

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Online Question Asking Algorithms For Measuring Skill

Algorithm-Independent Learning Issues

Topic 12 Overview of Estimation

1 Motivation for Instrumental Variable (IV) Regression

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

CS281 Section 4: Factor Analysis and PCA

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Qualifying Exam in Machine Learning

Physics 509: Propagating Systematic Uncertainties. Scott Oser Lecture #12

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

JORIS MULDER AND WIM J. VAN DER LINDEN

6.867 Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

COMPUTATIONAL MULTI-POINT BME AND BME CONFIDENCE SETS

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

COS513 LECTURE 8 STATISTICAL CONCEPTS

Statistics 3858 : Maximum Likelihood Estimators

Mathematical Statistics

Detecting Exposed Test Items in Computer-Based Testing 1,2. Ning Han and Ronald Hambleton University of Massachusetts at Amherst

Likelihood and p-value functions in the composite likelihood context

Choosing among models

Basic IRT Concepts, Models, and Assumptions

Probability and Estimation. Alan Moses

Teacher Support Materials. Maths GCE. Paper Reference MPC4

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Probability and Statistics

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Solve Systems of Equations Algebraically

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

CSC 411: Lecture 09: Naive Bayes

The Surprising Conditional Adventures of the Bootstrap

Time: 1 hour 30 minutes

Statistical Models with Uncertain Error Parameters (G. Cowan, arxiv: )

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

PhysicsAndMathsTutor.com

Introduction to Maximum Likelihood Estimation

Machine Learning for OR & FE

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

POLI 8501 Introduction to Maximum Likelihood Estimation

Support Vector Machines

Time: 1 hour 30 minutes

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

(1) Introduction to Bayesian statistics

Bridget Mulvey. A lesson on naming chemical compounds. The Science Teacher

Bootstrap and Parametric Inference: Successes and Challenges

Summer School in Applied Psychometric Principles. Peterhouse College 13 th to 17 th September 2010

What s the Average? Stu Schwartz

HOMEWORK #4: LOGISTIC REGRESSION

MS&E 226: Small Data

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

Fitting a Straight Line to Data

Bayesian Analysis of Latent Variable Models using Mplus

Lecture 10: Generalized likelihood ratio test

Multivariate Statistics

UNIVERSITY OF TORONTO Faculty of Arts and Science

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

1 Using standard errors when comparing estimated values

1 Mixed effect models and longitudinal data analysis

Lecture 2 Machine Learning Review

Part 6: Multivariate Normal and Linear Models

Model comparison and selection

Measuring nuisance parameter effects in Bayesian inference

Student Performance Q&A: 2001 AP Calculus Free-Response Questions

Probabilistic Machine Learning. Industrial AI Lab.

C A R I B B E A N E X A M I N A T I O N S C O U N C I L REPORT ON CANDIDATES WORK IN THE CARIBBEAN SECONDARY EDUCATION CERTIFICATE EXAMINATION

An Overview of Item Response Theory. Michael C. Edwards, PhD

Examiner s Report Pure Mathematics

Bayesian Learning (II)

2006 Mathematical Methods (CAS) GA 3: Examination 2

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Online Item Calibration for Q-matrix in CD-CAT

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

The number of marks is given in brackets [ ] at the end of each question or part question. The total number of marks for this paper is 72.

18.05 Practice Final Exam

Chapter 1 Statistical Inference

Principles of Bayesian Inference

Algebra Exam. Solutions and Grading Guide

Approximate Likelihoods

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Notes on Markov Networks

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Transcription:

Likelihood and Fairness in Multidimensional Item Response Theory or What I Thought About On My Holidays Giles Hooker and Matthew Finkelman Cornell University, February 27, 2008

Item Response Theory Educational Testing Traditional model of a test: A test is comprised of N items (questions) to be responded to. A pool of n subjects (students) each provides answers to each of the questions. Answers are labeled correct (1) or incorrect (0); each student then has a response vector y = (y 1,..., y N ). A score S(y) is then assigned to each subject. Classically: S(y) = N α i y i i=1

Item Response Theory Item Response Theory A more sophisticated approach: Each subject has an ability, given by θ R. The purpose of testing is to illicit information about θ. Model the probability of a correct response to item i by P i (θ) (called the response function). Perform statistical inference to estimate θ. Some reasons for doing this Measures of confidence about ability estimates. Measures of test quality. Early stopping. Adaptive testing procedures.

Item Response Theory Models and Methods Item response models: 2-parameter logistic model: with a i1 > 0. P i (θ) = 1 1 + exp ( a i1 (θ a i0 ) ) subset of ogive models: P i (θ) = F (a i1 (θ a i0 )) Estimates for θ (ess. GLM with fixed intercept): Maximum Likelihood: ˆθ MLE = argmax θ l N (θ; y) Maximum A-Posteriori: ˆθ MAP = argmax θ f N (θ y) Expected A-Posteriori: ˆθ EAP = θf N (θ y)dθ

Multidimensional IRT Multidimensional IRT There is more than one type of ability (eg. language vs analysis vs computation). Describe ability by θ R d. MIRT models become Compensatory: Non-compensatory: ) P i (θ) = F (a i0 + θ T a i1 P i (θ) = d F (a ij0 + θ j a ij1 ) j=1 Each a ij1 0 ensures that P i is increasing for each θ j. Choosing F to be log-concave makes optimization easy.

Multidimensional IRT Aside: Identifiability In logistic regression, models unidentifiable if the design matrix is singular answer sequence makes classes linearly separable In MIRT, intercepts are fixed separability is no problem. However a ij1 > 0, unidentifiable for homogenous response sequences Compensatory Models: if y = 1 only confidence interval for θ 1 is (, ], if y = 0 it is [, ). Non-compensatory Models: if y = 0, only confidence interval for θ 1 is [, ]. Only applies to likelihood-based estimates. But even frequentists can use Bayesian methods in IRT!

Multidimensional IRT Possible Scoring Rules Base scores on estimated ˆθ Nuisance parameters: S(y) = ˆθ 1. Want to assess analysis skills, but control for language (Veldkamp and van der Linden 2002). Multiple Hurdle: S(y) = I (ˆθ 1 > T 1, ˆθ 2 > T 2 ). Select only candidates with both high analysis and language skills (Segal, 2000).

A Story A Story

A Story A Story Jane and Jill are high school friends, taking a college entrance exam.

A Story A Story Jane and Jill are high school friends, taking a college entrance exam. Comparing notes afterwards, they gave the same answers up until the final question.

A Story A Story Jane and Jill are high school friends, taking a college entrance exam. Comparing notes afterwards, they gave the same answers up until the final question. Jill answered the last question correctly, Jane answered incorrectly.

A Story A Story Jane and Jill are high school friends, taking a college entrance exam. Comparing notes afterwards, they gave the same answers up until the final question. Jill answered the last question correctly, Jane answered incorrectly. When the results come back, Jane passed but Jill failed!

A Story A Story Jane and Jill are high school friends, taking a college entrance exam. Comparing notes afterwards, they gave the same answers up until the final question. Jill answered the last question correctly, Jane answered incorrectly. When the results come back, Jane passed but Jill failed! The college s position: These are legitimate results. Exam questions required both language and analysis (MIRT Analysis). Candidates only admitted if they excel at both. Implies a multiple hurdle model, based on MLE estimates of θ.

A Story An Explanation

A Story An Explanation Jill and Jane got some questions right and some wrong.

A Story An Explanation Jill and Jane got some questions right and some wrong. Final question heavily emphasized analysis.

A Story An Explanation Jill and Jane got some questions right and some wrong. Final question heavily emphasized analysis. Jill answered correctly strong analysis skills.

A Story An Explanation Jill and Jane got some questions right and some wrong. Final question heavily emphasized analysis. Jill answered correctly strong analysis skills. But she answered others incorrectly poor language skills.

A Story An Explanation Jill and Jane got some questions right and some wrong. Final question heavily emphasized analysis. Jill answered correctly strong analysis skills. But she answered others incorrectly poor language skills. Jane has fewer analysis skills; must have relied on language for correct answers.

A Story An Explanation Jill and Jane got some questions right and some wrong. Final question heavily emphasized analysis. Jill answered correctly strong analysis skills. But she answered others incorrectly poor language skills. Jane has fewer analysis skills; must have relied on language for correct answers.

A Story An Explanation Jill and Jane got some questions right and some wrong. Final question heavily emphasized analysis. Jill answered correctly strong analysis skills. But she answered others incorrectly poor language skills. Jane has fewer analysis skills; must have relied on language for correct answers. Jill s lawyer: Is it really reasonable to put students in the position of second-guessing when their best answer might be harmful to them?

Empirical Non-monotonicity Empirical Study Describe S(y) as non-monotone if there are answer sequences y 0 y 1 and S(y 0 ) > S(y 1 ). In a real world data set: 67-question English test of 7500 grade 5 students. Responses modeled as measuring listening and reading/writing. Reading/writing skills most emphasized in test treat this as "ability of interest". Test parameters estimated from 5000 students, 2500 remain to investigate non-monotonicity. ˆθ 1 (y) taken to be EAP with standard bivariate normal prior. From 2500 students, 4 pairs match the "Jane and Jill" story.

Empirical Non-monotonicity Hypotheticals: If only I had answered more questions wrong, I would have passed! Greedy algorithm: For each student Try changing each correct answer to incorrect. If at least one change leads to ˆθ 1 increasing, choose the item that lead to largest increase. Repeat until ˆθ 1 cannot be increased further Keep track of cumulative increase. Paradoxical increase found in range [0, 0.251] corresponding to [0, 5.2]% of range of estimated abilities.

Empirical Non-monotonicity Graphically What proportion of students would move from fail (ˆθ 1 < θ T ) to pass (ˆθ 1 θ T ) by getting more questions wrong? Paradoxical Increase Paradoxical Classification

Mathematical Analysis Mathematical Analysis Mathematically, how does non-monotonicity occur? How do we stop it? Possible variables: MIRT models for response functions statistical estimates test design

Mathematical Analysis An Intuition Contours of response functions always decreasing contours of likelihood look like negatively-oriented ellipses. MLE for θ 1 conditional on θ 2 is decreasing If ˆθ 2,MLE increases without changing ˆθ 1,MLE (θ 2 ), ˆθ 1,MLE must decrease. Achieve this with an item that is not affected by θ 1.

Mathematical Analysis A Simple Observation For log-concave ogives 2 θ i θ j l(θ; y) = 2 θ i θ j N y k log P i (θ)+(1 y i ) log(1 P i (θ)) < 0 i=1 for both compensatory and non-compensatory models. In particular θ 1 l(θ 1, θ + 1 ; y) < whenever θ + 1 > θ 1. θ 1 l(θ 1, θ 1; y)

Mathematical Analysis Formally We restrict to two dimensions: Theorem For every test with bivariate compensatory or non-compensatory items based on log-concave ogives and for every response sequence y such that ˆθ MLE (y) is well defined, a further item may be added to the test such that ˆθ 1,MLE (y, 1) < ˆθ 1,MLE (y, 0) That further item can be defined by being any item that does not depend on θ 1.

Mathematical Analysis Bounds on θ 1 Basing a threshhold on MLE values is silly, conduct a test! Define a profile likelihood lower bound by: { } ˆθ 1,PLB (y) = min θ 1 : max l(θ 1, θ 2 ; y) > l(θ MLE ; y) χ 2 1(α) θ 2 Theorem For every test with bivariate compensatory or non-compensatory items based on log-concave ogives and for every response sequence y such that ˆθ MLE (y) is well defined, a further item may be added to the test such that ˆθ 1,PLB (y, 1) < ˆθ 1,PLB (y, 0)

Mathematical Analysis Bayesian Methods Consider an independence prior on θ: with µ 1 and µ 2 log-concave. µ(θ 1, θ 2 ) = µ 1 (θ 1 )µ 2 (θ 2 ) The posterior f (θ y) has 2 log f (θ y)/ θ i θ j < 0 Theorem For every test with bivariate compensatory or non-compensatory items based on log-concave ogives and a log-concave independence prior, for every response sequence y a further item may be added to the test such that ˆθ 1,MAP (y, 1) < ˆθ 1,MAP (y, 0)

Mathematical Analysis Marginal Inference For univariate densities f 1 (x), f 2 (x) with cumulative distribution functions F 1 (x), F 2 (x), d log f 1 dx d log f 2 dx F 1 (x) F 2 (x) As θ 2 increases, θ 1 θ 2, y becomes stochastically smaller. Theorem For every test with bivariate compensatory or non-compensatory items based on log-concave ogives and a log-concave independence prior, for every response sequence y a further item may be added to the test such that F (θ 1 (y, 1)) > F (θ 1 (y, 0))

Compensatory Models Compensatory Models Can we restrict the items in a test so that non-monotone results can t occur? In compensatory models we have P i (θ) = F i (θ T a i ). Let [A] i = a T i be the "design matrix" for this test. Theorem For any test employing compensatory response models, let y be any response sequence such that ˆθ MLE (y) exists and is unique. Let F N (θ T b) be a further item in the test. A sufficient condition for ˆθ 1,MLE (y, 1) < ˆθ 1,MLE (y, 0) is e T 1 for all diagonal matrices W 0. ( A T WA) 1 b > 0

Compensatory Models In Bivariate Models What does that condition mean? e T 1 ( ) 1 ( A T WA b = C wi ai2b 2 1 ) w i a i1 a i2 b 2 = C w i a i2 (a i2 b 1 a i1 b 2 ) Corollary Let a test comprised of bivariate compensatory items be ordered such that a N2 a N1 > a i2 a i1, i = 1,..., N 1. For any response pattern y = (y 1,..., y N 1 ), ˆθ 1,MLE (y, 0) > ˆθ 1,MLE (y, 1) so long as these are well defined.

Compensatory Models Some Consequences ˆθ 1,MLE (y) exhibits non-monotonicity for every bivariate test using compensatory models. For almost all answer sequences, ˆθ 1,MLE (y) can be made to increase by changing an answer from "correct" to "incorrect", or to decrease by changing an answer from "incorrect" to "correct". There is a simple rule for which answer needs to be changed.

Compensatory Models Some Consequences ˆθ 1,MLE (y) exhibits non-monotonicity for every bivariate test using compensatory models. For almost all answer sequences, ˆθ 1,MLE (y) can be made to increase by changing an answer from "correct" to "incorrect", or to decrease by changing an answer from "incorrect" to "correct". There is a simple rule for which answer needs to be changed. A student facing a test of their ability in mathematics which also includes language processing as a nuisance dimension may be well advised to find the question that appears to place least emphasis on mathematics and make certain to answer it incorrectly.

Compensatory Models Some Consequences ˆθ 1,MLE (y) exhibits non-monotonicity for every bivariate test using compensatory models. For almost all answer sequences, ˆθ 1,MLE (y) can be made to increase by changing an answer from "correct" to "incorrect", or to decrease by changing an answer from "incorrect" to "correct". There is a simple rule for which answer needs to be changed. A student facing a test of their ability in mathematics which also includes language processing as a nuisance dimension may be well advised to find the question that appears to place least emphasis on mathematics and make certain to answer it incorrectly. But only if they can reliably identify this question!

Compensatory Models The Story Continues

Compensatory Models The Story Continues During the trial, it was discovered that Jane s uncle worked for the agency that designed the exam.

Compensatory Models The Story Continues During the trial, it was discovered that Jane s uncle worked for the agency that designed the exam. He denied any providing any improper assistance to his niece.

Compensatory Models The Story Continues During the trial, it was discovered that Jane s uncle worked for the agency that designed the exam. He denied any providing any improper assistance to his niece. I knew she was concerned about the language component, but I only told her not to get worried at the start of the test because we saved the toughest analysis question till last.

Compensatory Models The Story Continues During the trial, it was discovered that Jane s uncle worked for the agency that designed the exam. He denied any providing any improper assistance to his niece. I knew she was concerned about the language component, but I only told her not to get worried at the start of the test because we saved the toughest analysis question till last. Nobody found the copy of Hooker and Finkelman (2008) in her computer s recycle bin.

Compensatory Models The Analysis Continues For Bayesian methods, suppose that log µ θ θ T K Theorem For any test employing compensatory response models, let y be any response sequence. A sufficient condition for ˆθ 1,MAP (y, 1) < ˆθ 1,MAP (y, 0) is e T 1 ( A T WA + K) 1 b > 0 for all diagonal matrices W 0 and K K 0.

Compensatory Models Bivariate Interpretation Suppose µ bivariate normal with variances σ1 2, σ2 2 and correlation ρ. Sufficient condition for ˆθ 1,MAP (y, 1) < ˆθ 1,MAP (y, 0) is b 2 b 1 > Same condition holds for EAP. N i=1 w iai2 2 + 1 σ2 2(1 ρ2 ) N 1=1 w ia i1 a i2, w ρ i > 0 (1 ρ 2 )σ 1 σ 2 Even with ρ = 0 the condition is more stringent. Making σ2 2 small constraining the items to only vary with θ 1. Setting ρ = 0.8 did not entirely remove non-monotonicity in real data.

Extensions Extensions Discrete Ability Spaces Ability variables coded master/non master in each dimension. Can construct numerical examples of non-monotonicity. More difficult to analyze explicitly.

Extensions Extensions Non Log-Concave Ogives Common to use "guessing paramters" esp. for multiple-choice items P i (θ) = c i + (1 c i )P i (θ) Second derivatives of the likelihood will not always be negative. Only need this condition near MLE or MAP. In real data, with independence prior, all subjects could have their EAP changed paradoxically.

Extensions Extensions Three or more ability dimensions For MLE s, if final question does not depend on θ 1, ˆθ 1,MLE (y, 0) ˆθ 1,MLE (y, 1) ˆθ 1,MLE (y, 0) > ˆθ 1,MLE (y, 1). But the left hand side is not guaranteed. If LHS does not hold, ˆθ MLE behaves non-monotonically in some other dimension. Problem for multiple-hurdle model; less clear for nuisance dimensions.

Conclusions Constrained Estimates One solution: Jointly estimate θ j for each subject j. Impose constraints that θ j θ i if y j y i. Feasible (2min for real-world data using IPOPT routines and calculating θ MAP ). Moderate distortion: about the same as difference between MAP and EAP. Consistent: asymptotically y j y i doesn t happen. But does not address "what if" questions, cannot add more subjects. Estimate for all possible responses under constraints? Infeasible. (Inconsistent?)

Conclusions Conclusions Jane and Jill both took a test To get into a college; Jill beat Jane, but also failed Yet Jane might pass, we allege. Non-monotonicity has clear implications for test fairness. is sufficiently common in real-world data to be concerning. is theoretically inescapable in compensatory bivariate ability models using frequentist inference. is likely to occur in extensions of the models studied here. is unlikely to be eliminated by alternative statistical techniques without compromising properties such as consistency. Perhaps we should re-think the goals of educational testing, at least when the student cares about the result.