Bayesian Methods for Testing Axioms of Measurement

Bayesian Methods for Testing Axioms of Measurement George Karabatsos University of Illinois-Chicago University of Minnesota Quantitative/Psychometric Methods Area Department of Psychology April 3, 2015, Friday. Supported by NSF-MMS Research Grants SES-0242030 and SES-1156372

Outline I. Introduction: Axioms of Measurement. II. A. Relations of Axioms to IRT models. B. Rasch, 2PL, Monotone Homogeneity and Double-Monotone IRT models. General Bayesian Model for Axiom Testing A. Model Estimation (MCMC). B. Axiom Testing Procedures III. Empirical Illustrations of Bayesian Axiom Testing. a) Convict data (orig. analyzed by Perline Wright & Wainer, 1979, APM). b) NAEP reading test data IV. Dealing with axiom violations: A Bayesian Nonparametric outlier-robust IRT model with application to teacher preparation survey from PIRLS. V. Extensions of the Bayesian axiom testing model. VI. Conclusions 2

I. Introduction IRT models aim to represent, via model parameters, persons (examinees) and items on ordinal or interval scales of measurement. In IRT practice, such measurement scales are assumed for the parameters. The ability to represent persons and items on ordinal or interval scales depends on the data satisfying a set of key cancellation axioms (Luce & Tukey, 1964, JMP). These axioms are deterministic, but we can state these axioms in more probabilistic terms, as follows. We first briefly consider the deterministic case, to motivate the probabilistic approach. 3

I. (Deterministic) Axioms of Measurement Levels of the column variable j = 1 2 3 4 5 6 i = 0 Y(0,1) Y(0,2) Y(0,3) Y(0,4) Y(0,5) Y(0,6) 1 Y(1,1) Y(1,2) Y(1,3) Y(1,4) Y(1,5) Y(1,6) 2 Y(2,1) Y(2,2) Y(2,3) Y(2,4) Y(2,5) Y(2,6) Levels of the row variable 3 Y(3,1) Y(3,2) Y(3,3) Y(3,4) Y(3,5) Y(3,6) 4 Y(4,1) Y(4,2) Y(4,3) Y(4,4) Y(4,5) Y(4,6) 5 Y(5,1) Y(5,2) Y(5,3) Y(5,4) Y(5,5) Y(5,6) 6 Y(6,1) Y(6,2) Y(6,3) Y(6,4) Y(6,5) Y(6,6) 4

I. Deterministic Single Cancellation Axiom Levels of the column variable j = 1 2 3 4 5 6 i = 0 Y(0,1) Y(0,2) Y(0,3) Y(0,4) Y(0,5) Y(0,6) 1 Y(1,1) Y(1,2) Y(1,3) Y(1,4) Y(1,5) Y(1,6) 2 Y(2,1) Y(2,2) Y(2,3) Y(2,4) Y(2,5) Y(2,6) Each column: Premise Implication Each row: Levels of the row variable 3 Y(3,1) Y(3,2) Y(3,3) Y(3,4) Y(3,5) Y(3,6) 4 Y(4,1) Y(4,2) Y(4,3) Y(4,4) Y(4,5) Y(4,6) Premise Implication 5 Y(5,1) Y(5,2) Y(5,3) Y(5,4) Y(5,5) Y(5,6) 6 Y(6,1) Y(6,2) Y(6,3) Y(6,4) Y(6,5) Y(6,6) Like a Guttman scale (1950) 5

I. Probabilistic Measurement Theory Test Items in easiness order j = 1 2 3 4 5 6 i = 0 01 02 03 04 05 06 Define: ij Ability Level (test score) 1 11 12 13 14 15 16 2 21 22 23 24 25 26 3 31 32 33 34 35 36 4 41 42 43 44 45 46 Probability that person with score level i answers item j correctly. 5 51 52 53 54 55 56 6 61 62 63 64 65 66 6

I. Single Cancellation Axiom (rows) Test Items in easiness order j = 1 2 3 4 5 6 Each row: i = 0 01 02 03 04 05 06 Premise 1 11 12 13 14 15 16 Implication 2 21 22 23 24 25 26 Ability Level (test score) 3 31 32 33 34 35 36 4 41 42 43 44 45 46 5 51 52 53 54 55 56 6 61 62 63 64 65 66 7

I. Single Cancellation Axiom (rows) Key axiom for representing person ability (test score) on an ordinal scale. All Item Response Theory Models, which are of the form Pr(Y j = 1 ) = G j () for non-decreasing G j : R [0,1], assume this axiom. Examples of such IRT models: 1PL Rasch model: Pr(Y j = 1 ) = exp( j ) / [1+ exp( j )] 2PL: Pr(Y j = 1 ) = exp(a j { j }) / [1 + exp(a j { j })] 3PL: Pr(Y j = 1 ) = c j + (1 c j ) / [1 + exp(a j { j })] MH Model: Pr(Y j = 1 ) is non-decreasing in. DM Model: Pr(Y j = 1 ) is non-decreasing in, AND IIO: Pr(Y 1 = 1 ) < Pr(Y 2 = 1 ) < < Pr(Y J = 1 ) for all. 8

I. Single Cancellation Axiom Test Items in easiness order j = 1 2 3 4 5 6 Each row: i = 0 01 02 03 04 05 06 Premise Ability Level (test score) 1 11 12 13 14 15 16 2 21 22 23 24 25 26 3 31 32 33 34 35 36 4 41 42 43 44 45 46 Implication Each column: Premise Implication 5 51 52 53 54 55 56 6 61 62 63 64 65 66 9

I. Single Cancellation Axiom Key axiom for representing person ability (test score) and item easiness (difficulty) on a common ordinal scale. Examples of IRT models that (fully) assume single cancellation: 1PL Rasch model: Pr(Y j = 1 ) = exp( j ) / [1+ exp( j )] OPLM model: Pr(Y j = 1 ) = exp({ j }) / [1+ exp({ j })] DM Model: Pr(Y j = 1 ) is non-decreasing in, and IIO: Pr(Y 1 = 1 ) < Pr(Y 2 = 1 ) < < Pr(Y J = 1 ) for all. 10

I. Double Cancellation Axiom Test Items in easiness order j = 1 2 3 4 5 6 i = 0 01 02 03 04 05 06 1 11 12 13 14 15 16 Premise Ability Level (test score) 2 21 22 23 24 25 26 3 31 32 33 34 35 36 4 41 42 43 44 45 46 5 51 52 53 54 55 56 Implication Axiom must hold for all 3 3 submatrices 6 61 62 63 64 65 66 11

I. Triple Cancellation Axiom Test Items in easiness order j = 1 2 3 4 5 6 i = 0 01 02 03 04 05 06 1 11 12 13 14 15 16 Premise Ability Level (test score) 2 21 22 23 24 25 26 3 31 32 33 34 35 36 4 41 42 43 44 45 46 5 51 52 53 54 55 56 Implication Axiom must hold for all 4 4 submatrices 6 61 62 63 64 65 66 12

I. Single, Double, Triple, and all higher order cancellation axioms Key axioms for representing person ability (test score) and item easiness (difficulty) on a common interval scale. All these axioms, together, are axioms for additive conjoint measurement. Examples of IRT models that (fully) assume single cancellation: 1PL Rasch model (logistic): Pr(Y j = 1 ) = exp( j ) / [1+ exp( j )] Any 1PL model of the form: Pr(Y j = 1 ) = G( j ), for non-decreasing G: R [0,1] common to all test items. All previous discussions about measurement axioms and IRT also apply to polytomous IRT models. 13

How to Test Measurement Axioms? Even the probabilistic measurement axioms are deterministic. They assert deterministic order relations among probabilities. Perline, Wright & Wainer (PWW; 1979, APM), to test the Rasch model, analyzed data from a 10-item dichotomous-scored test administered to 2500 released convicts (from Hoffman & Beck, 1974). The test inquires about the subject s criminal history. PWW tested the conjoint measurement axioms on real data, by counting the number of axiom violations. For example, the number of rows violating single cancellation and, the number of 3 3 submatrices violating double cancellation. This axiom testing approach does not distinguish between small and large axiom violations. We illustrate this issue now. 14

True or Random Violation of the Single Cancellation Axiom? 15

True or Random Violation of the Single and Double Cancellation Axioms? Apparent single cancellation axiom violations in red Apparent double cancellation axiom violations in purple 16

How to Test Measurement Axioms? The number of axiom violations, as a statistic, has an intractable sampling distribution, for the purposes of hypothesis testing. The false discovery rate approach to multiple testing (Benjamini & Hochberg, 1995, JRSSB) is not easily applicable because the different axioms such as single cancellation and double cancellation are dependent of on other. 17

II. Bayesian Model for Axiom Testing Data likelihood: The Data: n = (n ij ) (I+1)J, n ij : # correct in test score group i for item j N = (N ij ) (I+1)J, N ij : # in test score group i who completed item j MLE: p = (p ij ) (I+1) J = (n ij / N ij ) (I+1)J. Prior density, i.e., set of axioms: I i0 I i0 J j1 J j1 Ln N, i0 Example: single cancellation axiom (rows & columns), I be( a,b): beta p.d.f. Be( a,b): beta c.d.f. Be 1 (u a,b): quantile. 1( A) = 1 if A. 1( A) = 0 if A. Often in practice, a = b =1 (truncated uniform prior) or a = b =½ (truncated reference prior). A = { : ij < i+1,j for i = 0,1,, I 1 & ij < i,j+1 for j =1,, J 1} (i: test score level; j indexes item in item easiness order) 18 J j1 be ij a ij,b ij 1 A be ij a ij,b ij 1 Ad N ij n ij n ij ij 1 ij N ijn ij

II. Bayesian Model for Axiom Testing Posterior Density (Distribution): Ln N, N,n,A Ln N, d I i0 I i0 J j1 J j1 N ij n ij N ij n ij n ij ij 1 ij N ijn ij be ij a ij,b ij 1 A n ij ij 1 ij N ijn ij be ij a ij,b ij 1 Ad I i0 J j1 be ij a ij n ij,b ij N ij n ij 1 A I i0 J j1 be ij a ij,b ij 1 A 19

II. Bayesian Model for Axiom Testing Posterior Density (Distribution): (c.d.f. ( N, n, A) ) N,n,A I i0 I i0 J j1 J j1 Posterior cannot be numerically evaluated. N ij n ij N ij n ij n ij ij 1 ij N ijn ij be ij a ij,b ij 1 A n ij ij 1 ij N ijn ij be ij a ij,b ij 1 Ad MCMC full conditional posterior p.d.f.s (f.c.p.s): π(θ ij N, n, θ \ij ) be(θ ij a ij + n ij, b ij + N ij n ij )1(θ A), i, j Each MCMC sampling iteration: For every pair i, j in turn, update/sample θ ij by drawing u ij ~ U(0,1), and then taking: ij Be 1 Be min ij a ij, b ij u ij Be max ij a ij, b ij Be min ij a ij, b ij (inverse c.d.f. sampling method; Devroye, 1986). As # of MCMC iterations S gets larger, the MCMC chain {θ (s) } s=1,..,s converges to samples from the posterior distribution (θ N, n, A). a ij, b ij 20

II. Bayesian Model for Axiom Testing Possible ways to test axioms from model: 1. Check if p ij = n ij / N ij is within 95% posterior interval of the marginal posterior distribution (θ ij N, n, A). Decide violation of axiom(s) if p ij is located outside the 95% posterior interval. 2. Compute the posterior predictive p-value (Karabatsos Sheu 2004 APM): pvalue ij 1 2 with: p rep ij ; ij 2 p ij ; ij p rep ij N,n,Adp rep ij d 2 p ij ; ij N ijp ij N ij ij 2 N ij ij n rep ij N ij, ij bin ij N ij, ij, with p rep ij n rep ij /N ij Decide violations of axioms if pvalue ij <.05. (or smaller) 21

II. Bayesian Model for Axiom Testing Possible ways to test axioms from model (continued): 3. Consider the Deviance Information Criterion (DIC) DIC D 2 D D Deviance: D 2 I i0 J j1 n ij log ij N ij n ij log1 ij log Deviance at posterior mean: D DE N,n,A Posterior mean of deviance: D Dd N,n,A D is goodness (badness) of fit term. 2 D D is model flexibility penalty, given by 2 times the effective number of model parameters. Consider DIC(A) of model under axiom (order) constraints, and DIC(U) for unconstrained model (no order constraints). Decide violations of axiom(s) if DIC(A) > DIC(U). 22 N ij n ij

Apparent single cancellation axiom violations in red 23

Test of single cancellation (over rows only) No significant violation of single cancellation over rows. results from Karabatsos (2001, JAM) 24

Test of single cancellation (over rows and columns) Significant violation of single cancellation axiom results from Karabatsos (2001, JAM) 25

True or Random Violation of the Single and Double Cancellation Axioms? Apparent single cancellation axiom violations in red Apparent double cancellation axiom violation in purple 26

Significant violation of single and double cancellation axiom Test of single and double cancellation 27 (Karabatsos, 2001, JAM)

NAEP data 100 examinees 6 items results from Karabatsos & Sheu (2004, APM) NAEP reading test data Posterior Predictive Chi-square test of single cancellation (over rows). Violations indicated by bold. 28 George Karabatsos, 3/27/2015

NAEP data 100 examinees 6 items results from Karabatsos & Sheu (2004, APM) Posterior Predictive Chi-square test of single cancellation (over columns). Violations indicated by bold. 29

IV. Dealing With Axiom Violations We have seen from the previous two empirical applications that the measurement axioms can be violated, even from data arising from carefully-constructed tests. One way to deal with the problem is by defining a more flexible IRT model that can handle outliers. A flexible Bayesian Nonparametric outlier-robust IRT model. Will present and briefly illustrate the model through the analysis of data arising from a teacher preparation survey from PIRLS. 244 respondents (teachers). Each rated (0-2) own level of teacher preparation on 10 items: CERTIFICATE, LANGUAGE, LITERATURE, PEDAGOGY, PSYCHOLOGY, REMEDIAL, THEORY, LANGDEV, SPED, SECLANG. Also included covariates AGE, FEMALE, Miss:FEMALE. 30

P fd X; p1 J j1 fy pj x pi ; BNP-IRT model Karabatsos (2015, Handbook of Modern IRT) fy pj x pj ; PY pj 1 x pj ; y pj 1 PY pj 1 x pj ; 1y pj PrY 1 x; 1 F 0 x; fy x; dy k x;, 0 ny k x, 2 j x;, dy k k x k 1 x k, 2 N k 0, 2 U 0,b, N 0, 2 vdiag,j NJ N 0, 2 v I NJ1 2, 2 IG 2 a 0 /2,a 0 /2IG 2 a /2,a /2. 0 Persons (examinees) indexed by p = 1,,P Test items indexed by j = 1,,J 31

Absolutely no item response outliers under the BNP-IRT model. 33

beta0 beta:certificate(1) beta:language(1) beta:literature(1) beta:pedagogy(1) beta:psychology(1) beta:remedial(1) beta:theory(1) beta:langdev(1) beta:sped(1) beta:seclang(1) beta:age(1) beta:female(1) beta:miss:female(1) beta:certificate(2) beta:language(2) beta:literature(2) beta:pedagogy(2) beta:psychology(2) beta:remedial(2) beta:theory(2) beta:langdev(2) beta:sped(2) beta:seclang(2) beta:age(2) beta:female(2) beta:miss:female(2) sigma^2 sigma^2_mu beta_w0 beta_w:certificate(1) beta_w:language(1) beta_w:literature(1) beta_w:pedagogy(1) beta_w:psychology(1) beta_w:remedial(1) beta_w:theory(1) beta_w:langdev(1) beta_w:sped(1) beta_w:seclang(1) beta_w:age(1) beta_w:female(1) beta_w:miss:female(1) beta_w:certificate(2) beta_w:language(2) beta_w:literature(2) beta_w:pedagogy(2) beta_w:psychology(2) beta_w:remedial(2) beta_w:theory(2) beta_w:langdev(2) beta_w:sped(2) beta_w:seclang(2) beta_w:age(2) beta_w:female(2) beta_w:miss:female(2) sigma^2_w Value 10 5 0 Dependent variable = itemrespvs0 For BNP-IRT model, boxplot of the marginal posterior distributions of the item, covariate, and prior parameters. -5 The estimated posterior means of the person ability parameters were found to be distributed with mean.00, s.d..46, minimum.66, and maximum 3.68 for the 244 persons. 34

V. Conclusions The ability to measure persons and/or items on an ordinal or interval scale depends on data satisfying a hierarchy of conjoint measurement axioms, including single, double, triple cancelation, and higher order cancellation conditions. We presented Bayesian model that can represent a set of one or more axioms in terms of order constraints on binomial parameters, with the constraints enforced by the prior distribution. This model provided a coherent approach to test the measurement axioms on real data sets. 35

V. Conclusions Applications of the Bayesian axiom testing model showed that the measurement axioms can be violated from data arising even from carefully constructed tests. As a possible remedy to this issue, we propose a more flexible, BNP-IRT model that can provide estimates of person and item parameters that are robust to any item response outliers in the data. In a sense the BNP-IRT model is not wrong for the data; It is a highly flexible model which makes rather irrelevant the practice of model-checking or axiom testing or model fit analysis. For related arguments, see Karabatsos & Walker 2009, BJMSP). 36

V. Conclusions The Bayesian axiom testing model of Karabatsos (2001), was later used to -- test decision theory axioms (e.g., Myung et al., 2005, JMP); -- test measurement axioms (e.g., Kyngdon, 2011; Domingue 2012). The latter author suggested a minor modification to the MH algorithm of Karabatsos (2001) to handle more orderings under double cancellation.; Like Karabatsos & Sheu (2004), this talk focused on a Gibbs sampler which is usually preferable to a rejection sampler like the MH algorithm, for MCMC practice. etc. Karabatsos (2005, JMP) defined binomial parameter as the probability of choice that satisfied an axiom. Then under a conjugate beta prior for, we may directly calculate a Bayes factor to test the axiom (H 0 ) according to H 0 : > c versus H 1 : < c for some large c, such as.95. 37

Extensions of Axiom Testing Model (1) Allow for random orderings for the cancellation axioms. Consider the joint posterior distribution: (,,, A, Y, N, n) = (,, A, Y) ( N, n, A, ) given Rasch model: Posterior distribution:, Y y pj NJ As before, N,n,A, i0 I PY pj 1 p, j N J p1 j1 N J p1 j1 J j1 exp p j y pj 1 exp p j n, 0,I NJ exp p j y pj exp p j 1 exp p j 1 exp p j dn, 0,I NJ be ij a ij,b ij 1 A, A, is the random linear rank ordering that the matrix ( P(Y pj = 1,) ) NJ induces on = ( ij ) (I+1)J. This ordering automatically satisfies all cancellation axioms. 38

Extensions of Axiom Testing Model (1) Then the joint posterior distribution: (,,, A, Y, N, n) can be estimated by using the usual MCMC methods. For each stage of the MCMC chain, {( (s), (s), (s), A, (s) )} s=1:s, the Gibbs sampler (inverse c.d.f.) method would be used to provides a Gibbs sampling update for (s), based on the updated ordering A, (s). Then the Bayesian axiom tests as before, but now they are based marginalizing these tests over the posterior distribution of A,. 39

Extensions of Axiom Testing Model (2) Extend the independent (truncated) Beta priors for the ij s namely ~ i j Be( ij a, b) 1( A) to a prior defined by a discrete mixture of beta distributions. = ( ij ) (I+1)J ~ iid i j Be( ij a, b)dg(a, b) 1( A), G ~ DP(,G 0 ) where E[G(a, b)] = G 0 (a, b) := N 2 (log(a),log(b) 0, V) Var[G(a, b)] = G 0 (a, b) [1G 0 (a, b)] / ( + 1) Any smooth distribution defined on (0,1) can be approximated arbitrarily-well by a suitable mixture of beta distributions. Such a prior would define a more flexible Bayesian axiom testing model, based on a richer class of prior distributions. 40

Other Work / Collaborations Bayesian nonparametric inference of distribution function under stochastic ordering: F 1 < F 2 < < F K (Karabatsos & Walker, 2007, SPL). o Considered Bernstein polynomial priors and Polya tree priors for the Fs. In each case, posterior inference based on order-constrained beta posterior distributions (as in Karabatsos 2001). Bayesian nonparametric score equating model using a novel dependent Bernstein-Dirichlet polynomial prior for the test score distribution functions (F X, F Y ) used for equipercentile equating (Karabatsos & Walker, 2009, Psychometrika). Bayesian inference for test theory without an answer key (Karabatsos & Batchelder, 2003, Psychometrika). Comparison of 36 person fit statistics (Karabatsos 2003, AME). 41

Other Work / Collaborations Karabatsos, G., and Walker, S.G. (2012). A Bayesian nonparametric causal model. J Statistical Planning & Inference. o DP mixture of propensity score models for causal inference in nonrandomized studies. Karabatsos, G., and Walker, S.G. (2012). Bayesian nonparametric mixed random utility models. Computational Statistics & Data Analysis, 56, 1714-1722. o In terms of an IRT model, provides a DP infinite-mixture of nominal response models, with person and item parameters subject to the infinite-mixture. Fujimoto, K., and Karabatsos, G. (2014). Dependent Dirichlet Process Rating Model (DDP-RM). Applied Psychological Measurement, 38, 217-228. o Model allows for clustering of ordinal category thresholds. o Ken Fujimoto: former Ph.D. student. Now faculty at Loyola U. Chicago 42

Other Work / Collaborations Karabatsos, G., and Walker, S.G. (2012). Adaptive-Modal Bayesian Nonparametric Regression (EJS). o IRT version of this model, mentioned in this talk, to appear in Handbook Of Item Response Theory (2015). o Model extended to meta analysis: Karabatsos, G., Walker, S.G., and Talbott, E. (2014). A Bayesian nonparametric regression model for meta-analysis. Research Synthesis Methods. o Model extended for causal inference in non-randomized, regression discontinuity designs: (Karabatsos & Walker, 2015; (to appear in Müller and R. Mitra (Eds.), Nonparametric Bayesian Methods in Biostatistics and Bioinformatics). 43