LSAC RESEARCH REPORT SERIES. Law School Admission Council Research Report March 2008

Size: px

Start display at page:

Download "LSAC RESEARCH REPORT SERIES. Law School Admission Council Research Report March 2008"

Brianna Hunter
6 years ago
Views:

1 LSAC RESEARCH REPORT SERIES Structural Modeling Using Two-Step MML Procedures Cees A. W. Glas University of Twente, Enschede, The Netherlands Law School Admission Council Research Report March 2008 A publication of the Law School Admission Council

3 The Law School Admission Council (LSAC) is a nonprofit corporation whose members are more than 200 law schools in the United States, Canada, and Australia. Headquartered in Newtown, PA, USA, the Council was founded in 1947 to facilitate the law school admission process. The Council has grown to provide numerous products and services to law schools and to more than 85,000 law school applicants each year. All law schools approved by the American Bar Association (ABA) are LSAC members. Canadian law schools recognized by a provincial or territorial law society or government agency are also members. Accredited law schools outside of the United States and Canada are eligible for membership at the discretion of the LSAC Board of Trustees by Law School Admission Council, Inc. All rights reserved. No part of this work, including information, data, or other portions of the work published in electronic form, may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system, without permission of the publisher. For information, write: Communications, Law School Admission Council, 662 Penn Street, Box 40, Newtown, PA This study is published and distributed by LSAC. The opinions and conclusions contained in this report are those of the author(s) and do not necessarily reflect the position or policy of LSAC.

5 i Table of Contents Executive Summary...1 Abstract...1 Introduction...1 The Model...2 MML1...3 The E-Step...3 The M-Step... 3 MML2...4 The E-Step...5 The M-Step...5 An Example...5 References...7

7 1 Executive Summary In a computerized adaptive test (CAT), test questions (items) are selected for administration to a test taker based on their performance on previous items, with the intent of tailoring the difficulty level of the test to the ability level of the test taker. Data from CATs are special in the sense that every student responds to a virtually unique set of items. Therefore, number-correct scores of different test takers can no longer be compared, and statistical methods traditionally used in the analysis of number-correct scores lose their relevance. The problem can be solved by applying these traditional methods in the analysis of proficiency scores typically produced by the application of an item response theory (IRT) model. IRT is a mathematical model used in the analysis of test data that results in comparable proficiency scores for each test taker even though the test takers may have responded to different items in the test administration. In this paper, we consider two methods for estimating test-taker proficiencies with traditional statistical methodology. We then compare these proficiency estimates with those obtained using an IRT model. We base our analyses on a dataset from school effectiveness research. The proficiency estimates were close to those obtained with the IRT model. The computing time needed for the traditional statistical methods was about 25% of the time needed for the IRT method. Abstract Two methods are considered for estimating the parameters of an item response (IRT) model with a multilevel regression model on the person parameters (the MLIRT model). In both methods it is assumed that the item parameters are known. The person parameter estimates are compared with the parameter estimates obtained by a fully Bayesian method where this assumption is not made. The first method is marginal maximum likelihood (MML) given the original item response patterns on the tests (labeled MML1). The second is MML given the estimates of the true scores on the tests (labeled MML2). In both cases, the parameters are estimated using the expectation-maximization (EM) algorithm. The methods are compared using a dataset from school effectiveness research. The parameter estimates were close to the estimates obtained using the fully Bayesian method. The MML estimation procedures took about 25% of the time needed for the fully Bayesian method. Introduction In educational and social research, there is a growing interest in the problems associated with describing the relationships between variables of different aggregation levels. For instance, one may be interested in the effects of the school budget on the educational achievement of the students. However, the former variable is defined on the level of schools while the latter variable is defined on the level of students. These problems can be addressed through the use of multilevel models (Bryk & Raudenbush, 1992; Goldstein, 1986; Longford, 1993). In the above example, students are nested in schools; in a multilevel model, the students would make up a first level and the schools a secondary level. Although most applications of the multilevel paradigm are found in regression and analysis of variance models (see, for instance, Bryk and Raudenbush, 1992), multilevel modeling does, in principle, apply to all statistical modeling of data where elementary units are nested within higher order units. Longford (1993), for instance, gives examples of multilevel factor analytical models and generalized linear models. Also, in the field of item response theory (IRT), some applications of the multilevel paradigm can be found (Adams, Wilson, & Wu, 1997; Mislevy & Bock, 1989). The present paper investigates the following problem related to multilevel modeling. In educational research, many variables are measured subject to error. This predominately concerns the dependent variables, but covariates on the student and school level can be subject to measurement error too. In practice, the multilevel models used belong to the framework of the usual linear multivariate normal model, and solutions to the problem of measurement error boil down to applications of classical test theory (Longford, 1993). One of the many drawbacks of classical test theory is that measurement error is supposed to be independent of the score level of the person. In modern test theory (i.e., IRT), measurement error is defined conditionally on the level of the latent ability variable. Therefore, it seems worthwhile to tackle the problem of measurement error in the framework of hierarchical IRT models. We consider two methods for estimating the parameters of an IRT model with a multilevel regression model on the person parameters. In both methods it is assumed that the item parameters are known. The parameter estimates are compared with the parameter estimates obtained by a method where this assumption is not made (i.e., the method by Fox and Glas, 2001). The first method is marginal maximum likelihood (MML) given the original item response patterns on the tests (labeled MML1). The second is MML given the estimates of the true scores on the tests (labeled MML2). In both cases the parameters are estimated using the expectation-maximization (EM) algorithm. The methods are compared using a dataset from school effectiveness research.

8 2 The Model Assume Level 1 units, say, students i = 1,, N j, are nested within Level 2 units j = 1,, J and respond to a number of tests labeled q = 1,..., Q. We impose a two-level regression model on the latent variables: q P =å b +e ijq jpq ijp ijq p= 1 x (1) and S b = å g w + u. (2) jpq pqs jspq jpq s= 1 x ij1 and w j1pq are equal to one. The other values of x ijp and w jspq are values of observed covariates at Level 1 and Level 2, respectively. In principle, latent predictors can also be present. In that case, it is assumed that the model is rewritten to the socalled reduced form, where all latent variables are on the left-hand side of the regression equations. The error terms have distributions where is a Q Q covariance matrix, and ij N(, 0 ), (3) j N(, 0T), where T is a PQ PQ covariance matrix. and T are not restricted to be diagonal. It is assumed that q ijq is a parameter in a unidimensional response model, either for discrete responses or continuous responses. The latter might, for instance, be the logarithms of response times (van der Linden, 2006, 2007, 2008). Discrete responses will be responses on a test with polytomously scored items labeled k = 1,..., Kq. Item responses will be stochastic variables Uijqk with realizations u ijqk. The realizations are integers between 0 and mqk. For educational tests, it is reasonable to assume that the probability of obtaining a score of 0 decreases as a function of q, the probability of obtaining the maximum score increases as a function of q, and the probabilities of the intermediate scores are single peaked. Such response probabilities are produced by the generalized partial credit model (GPCM; Muraki, 1992), where the probability of a response is given by exp( laqkqijq - bqkl ) PU ( = l q ) =. (4) 1 + exp( ha q - b ) ijqk ijq å m h= 1 qk ijq qkh The partial credit model (PCM; Masters, 1982) is a special case where a k = 1 for all items k. The two-parameter logistic model (2PL model; Birnbaum, 1968) is a special case that pertains to dichotomously scored responses (so m qk = 1), and the one-parameter logistic model (1PL model or Rasch model; Rasch, 1960) is a further specialization where aqk = 1 for all items qk.

9 3 ij Continuous responses can be modeled by a normal response model where it is assumed that the response of student on item qk is normally distributed, that is, æ 1 ( u ) 2 ijqk t ö - ijqk PU ( ijqk = uijqk qijq, akq, bqk ) = exp ps 2s, ç k çè k ø with t ijqk = aqkqijq - bqk. MML1 We use the EM algorithm (Dempster, Laird, & Rubin, 1977) to estimate,, and T given responses U Level 1 covariates X, Level 2 covariates W, and item parameters a and b. To apply the logic of the EM algorithm, the values of given by q ijq j are seen as missing data (Bock & Aitkin, 1981). The complete data specification for a person ij is then p( u,,, a, b, x, W,, T) µ ij ij j ij ij ij j p( u, a, b ) p(,, x ) p( W,, T), ij ij ij ij ij j ij j j where is a PQS -dimensional vector of the elements g pqs, and j is a PQ -dimensional vector of the elements b jpq. The E-step consists of estimating the posterior distribution p( y, a,,,, x, W,, T), and the M-step consists ij ij ij b ij j ij j of a maximization of E(log( L q (,,, T) Y,,, D, X, W,, T ) with respect to,,, and T. The E-Step, y is The E-step consists of computation of the posterior distribution of the missing data. For the present application, this p( u, a, b,,,, T) ij ij ij ij j p( uij ij, aij, bij ) p( ij j,, xij ) p( j Wj,, T) = [ ò.. ò p( uij ij, aij, bij ) p( ij j,, xij ) d ij ] p( j Wj,, T) p( uij ij, aij, bij ) p( ij j,, xij ) = ò.. ò p( u, a, b ) p(,, x ) d ij ij ij ij ij j ij ij. The M-Step In the M-step, we maximize with respect to,,, and T. The marginal likelihood equations needed in the M-step can be easily derived using Fisher s identity (Efron, 1977; Louis, 1982; see Glas, 1992, 1998 for applications to IRT). Fisher s identity sets the first-order derivatives with respect to the model parameters equal to the expectation with respect to the posterior distribution obtained in the E-step of the first-order derivatives of the complete data likelihood. However, the model for the complete data is just an ordinary multilevel regression model, so the first-order derivatives of the complete data likelihood are easily obtained. Since the parameters are regression coefficients in a linear regression model, we can apply the generalized least-squares (GLS) estimator, which is equivalent to the maximum likelihood estimator.

10 4 Define W jpq as an S -dimensional vector of the elements w jspq and Wj = { Wjpq} ÄI PQ. Note that W j is a PQ PQS matrix. Then the GLS estimate is obtained as å and the covariance matrix of the residuals estimated as j j j j ( WT W) WT j (5) j å 1 T ( j - j )( j - j J å W W ). (6) j Define X * j = { Xj} ÄI Q, with X j a matrix of the elements { x ijp}. Note that X j is an to j we then have NQ PQ j matrix. With respect where j = { jq} Ä1Q, and jq has elements h j = ( X * j X * j T - ) - é * j j - ù ë X T Wj û, (7) òò.. q p( u, a, b,,, x ) d, i = 1,..., N j. (8) ijq ijq ij ij ij ij j ij ij For, we have the estimate 1.. ( ij - j ij ) ( ij - j ij ) p( ij ij, ij, ij, ) d ij N åò ò BX BX u a b, (9) ij with ij a Q -vector of the elements qijq, Xij a P -dimensional vector of the elements x ijp, and B j a Q P matrix of the elements b jpq. This process is iterated until convergence is achieved. The multiple integrals that appear above can be evaluated using adaptive Gauss-Hermite quadrature (Schilling & Bock, 2005). A critical point related to using Gauss-Hermite quadrature is the dimensionality of the latent space (i.e., the number of latent variables that can be analyzed simultaneously). Wood et al. (2002) indicate that the maximum number of dimensions is 10 with adaptive quadrature, 5 with nonadaptive quadrature, and 15 with Monte Carlo integration. MML2 The second approach is a generalization of a method by Rubin and Thomas (2001). Assume that are estimated latent variables of persons i = 1,, N j within Level 2 units j = 1,..., J, on scales q = 1,..., Q. The estimates have unknown estimation errors e ijq, with estimated variance d ij q; that is, y ijq = qijq + eijq, where it is assumed that has a Q -variate normal distribution with expectation zero and ij Cov( e, e ) = D = diag( d,...,,..., d ). Thus, ij ij ij ij1 d ijq ijq y ijq e ij N( 0, D ), ij In the application below, maximum likelihood estimates and the squared estimated standard errors are used for d ijq, respectively. y ijq and

11 5 Again, we use the EM-algorithm to estimate,, and T given Y, X, W, and D. Analogous to MML1, consider q ijq as missing data. The complete-data specification for a person ij is then given by p( y,,, D, x, W,, T) µ ij ij j ij ij j ( y, D ) p(,, x ) p( W,, T), ij ij ij ij j ij j j where is a PQS -dimensional vector of the elements g pqs, and j is a PQ -dimensional vector of the elements b jpq. The E-step consists of estimating the posterior distribution p ( ij yij, Dij, j,, xij, Wj,, T), and the M-step consists of a maximization of E(log( L q, y (,,, T) Y,,, D, X, W,, T) with respect to,,, and T. Note that this is very similar to MML1. The E-Step Compute p( yij ij, Dij ) p( ij j,, xij ) p( j Wj,, T) p( ij yij, Dij, j,,, T) = [ ò.. ò p( y, D ) p(,, x ) d ] p( W,, T) ij ij ij ij j ij ij j j The M-step p( yij ij, Dij ) p( ij j,, xij ) = ò.. ò p( y, D ) p(,, x ) d ij ij ij ij j ij ij We have to maximize with respect to,,, and T. The equations for, the equations in MML2 given by (5), (6), and (7). Besides, (8) becomes h ò ò y D x.. q p(,,,, ) d, ijq ijq ij ij ij j ij ij for i = 1,..., N j. For the estimation of, we now have the equation with the same definitions as above.. T, and 1.. ( ij - j ij ) ( ij - j ij ) p( ij ij, ij, ) d ij N åò ò BX BX y D, ij An Example are completely analogous to The estimation procedures will be compared to each other and to a fully Bayesian procedure using an application reported by Shalabi (2002). The data were a cluster sample of 3,384 seventh grade students in 119 schools. At student level the variables were Gender (0 = male, 1 = female), SES (with two indicators: the father s and mother s education; score range, 0 8), and IQ (score range, 0 80). School-level variables were Leadership (measured by a scale consisting of 25 five-point Likert items, administered to the school teachers), School Climate (measured by a scale consisting of 23 five-point Likert items) and Mean-IQ (the IQ scores aggregated at school level). The items assessing Leadership and School Climate were dichotomous. The dependent variable was a mathematics achievement test consisting of 50 multiple-choice items. The 2PL model was used to model the responses to the Leadership and School Climate questionnaire and the mathematics test. The parameters were estimated with the Gibbs sampler using the method by Fox and Glas (2001). For a complete description of all analyses, see Shalabi (2002); here only the estimates of the final

12 6 model are given as an example. The model is given by and q = b + bses + b Gender + b IQ + e ij 0j 1 ij 2 ij 3 ij ij b = g + g Mean-IQ + g Leadership + g Climate +. 0j j 02 j 03 j u0 j The results reported by Shalabi (2002) are given in Table 1. The estimates of the MLIRT model are compared with a traditional multilevel analysis where all variables were manifest. The observed Mathematics, Leadership, and School Climate scores were transformed in such a way that their scale was comparable to the scale used in the MLIRT model. Further, the parameters of the multilevel model were also estimated with a Bayesian approach using the Gibbs sampler. The columns labeled CI give the 90% confidence intervals of the unknown parameters; they were derived from the posterior standard deviation. It can be seen that the magnitudes of the fixed effects in the MLIRT model were larger than the analogous estimates in the multilevel model. This finding is in line with the other findings (Fox & Glas, 2001; Shalabi, 2002), which indicates that the MLIRT model has more power to detect effects in hierarchical data when some variables are measured with error. TABLE 1 Bayesian estimates of multilevel IRT model and multilevel model on observed scores MLIRT Estimates ML Estimates Variable Estimates CI Estimates CI g b 1 SES b 2 Gender b IQ g 01 Mean-IQ g 02 Leadership g Climate Variance components 2 t s CI = confidence interval; SES = socioeconomic status. The estimates and standard errors obtained by using MML1 and MML2 are shown in Table 2. By comparing the entries in Table 1 and Table 2, it can be seen that the estimates are all very close. This also holds when the credible regions obtained by the fully Bayesian method are compared with the confidence intervals obtained by maximum likelihood. The latter regions are slightly smaller, but this has no effect on the outcomes of tests of the hypotheses that certain effects are equal to zero.

13 7 TABLE 2 Marginal maximum likelihood estimates of multilevel IRT model using methods MML1 and MML2, respectively MML1 Estimates MML2 Estimates Variable Estimates CI Estimates CI g b 1 SES b 2 Gender b 3 IQ g 01 Mean-IQ g 02 Leadership g Climate Variance components 2 t s CI = confidence interval; SES = socioeconomic status. An advantage of the MML methods is their relative speed. The fully Bayesian analysis took about 3 hours, while the MML analyses took about 45 minutes. The fact that the latter is also not very fast has to do with fact that MML involved integration over a three-dimensional latent space with 10 quadrature points for every dimension. It must also be mentioned that the programming was not optimized with respect to speed. Further, the ever-increasing speed of computers may also further support the application of these models in complex situations. References Adams, R. J., Wilson, M. R., & Wu, M. (1997). Multilevel item response theory models: An approach to errors in variables of regression. Journal of Educational and Behavioral Statistics, 22, Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp ). Reading, MA: Addison-Wesley. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM-algorithm. Psychometrika, 46, Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models. Newbury Park: Sage. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, Efron, B. (1977). Discussion on maximum likelihood from incomplete data via the EM algorithm (by A. P. Dempster, N. M. Laird, and D. B. Rubin). Journal of the Royal Statistical Society, Series B, 39, Fox, J.-P., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, Glas, C. A. W. (1992). A Rasch model with a multivariate distribution of ability. In M. Wilson (Ed.), Objective measurement: Theory into practice (Vol. 1; pp ). Norwood, NJ: Ablex. Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, Goldstein, H. (1986). Multilevel mixed linear models analysis using iterative generalized least squares. Biometrika, 73, Longford, N. T. (1993). Random coefficient models. Oxford Statistical Science Series (Vol. 11). Oxford: Clarendon.

14 8 Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B, 44, Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, Mislevy, R. J., & Bock, R. D. (1989). A hierarchical item-response model for educational testing. In R. D. Bock (Ed.), Multilevel analysis of educational data. San Diego: Academic Press. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Rubin, D. B., & Thomas, N. (2001). Using parameter expansion to improve the performance of the EM algorithm for multidimensional IRT population-survey models. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp ). New York: Springer. Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70, Shalabi, F. (2002). Effective schooling in the West Bank. Unpublished doctoral thesis, University of Twente, the Netherlands. Wood, R., Wilson, D. T., Gibbons, R. D., Schilling, S. G., Muraki, E., & Bock, R. D. (2002). TESTFACT: Test scoring, item statistics, and item factor analysis. Chicago, IL: Scientific Software International, Inc. van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33, 5 20.

A Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions

A Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions Cees A.W. Glas Oksana B. Korobko University of Twente, the Netherlands OMD Progress Report 07-01. Cees A.W.