Application of the Stochastic EM Method to Latent Regression Models

Size: px

Start display at page:

Download "Application of the Stochastic EM Method to Latent Regression Models"

Jacob Virgil Harvey
5 years ago
Views:

1 Research Report Application of the Stochastic EM Method to Latent Regression Models Matthias von Davier Sandip Sinharay Research & Development August 2004 RR-04-34

3 Application of the Stochastic EM Method to Latent Regression Models Matthias von Davier and Sandip Sinharay ETS, Princeton, NJ August 2004

4 ETS Research Reports provide preliminary and limited dissemination of ETS research prior to publication. To obtain a PDF or a print copy of a report, please visit:

5 Abstract The reporting methods used in large scale assessments such as the National Assessment of Educational Progress (NAEP) rely on a latent regression model. The first component of the model consists of a p-scale IRT measurement model that defines the response probabilities on a set of cognitive items in p scales depending on a p-dimensional latent trait variable θ = (θ 1,... θ p ). In the second component, the conditional distribution of this latent trait variable θ is modeled by a multivariate, multiple regression on a set of predictor variables, which are usually based on student, school and teacher variables in assessments such as NAEP. In order to fit the latent regression model using the maximum (marginal) likelihood estimation technique, multivariate integrals have to be evaluated. In the computer program MGROUP used by ETS for fitting the latent regression model to data from NAEP and other sources, the integration is currently done either by numerical quadrature (for problems up to two dimensions) or by an approximation of the integral., the current operational version of the MGROUP program used in NAEP and other assessments since 1993, is based on Laplace approximation that may not provide fully satisfactory results, especially if the number of items per scale is small. This paper examines the application of stochastic expectation-maximization (EM) methods (where an integral is approximated by an average over a random sample) to NAEP-like settings. We present a comparison of with a promising implementation of the stochastic EM algorithm that utilizes importance sampling. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to for fitting multivariate latent regression models. Key words:, NAEP, importance sampling, stochastic integration i

6 Acknowledgements The authors thank John Mazzeo, Shelby Haberman, Alina von Davier, Neal Thomas, Andreas Oranje, Ying Jin, and Matthew Johnson for useful advice, Steve Isham for help with the data sets used in the analysis, and Kim Fryer for help with proofreading. ii

7 1 Introduction National Assessment of Educational Progress (NAEP), the only regularly administered and congressionally mandated national assessment program (see, e.g., Beaton & Zwick, 1992), is an ongoing survey of the academic achievement of the school students in the United States in a number of subject areas such as reading, writing, mathematics, and the like. It is administered by the National Center for Educational Statistics (NCES), a part of the U.S. Department of Education, to a selected sample of students in Grades 4, 8, and 12. A document called Nation s Report Card reports the results of NAEP for a number of academic subjects on a regular basis. A comparison is provided to the previous assessments in each subject area. The academic achievement is described as average student proficiency for all students in the U.S. and as the percentage of students attaining fixed levels of proficiency (the levels being defined by the National Assessment Government Board as what students should know at each grade level) in different subjects. In addition to producing these numbers for the nation as a whole, NAEP reports the same results for different subpopulations (based on gender, race, school-type, etc.) of the student population. NAEP is prohibited by law from reporting results for individual students, schools, or school districts and is designed to obtain optimal estimates of subpopulation characteristics rather than those of individual performance. To assess national performance in a valid way, NAEP must sample a wide and diverse body of student knowledge. To avoid the burden involved with presenting each item to every examinee, NAEP selects students randomly from designated grade and age populations (first, a sample of schools are selected according to a detailed stratified sampling plan, as mentioned in, e.g., Beaton & Zwick, 1992, and then students are sampled within the schools). NAEP then administers one of many possible booklets of items to each student. This process is sometimes referred to as matrix sampling of items. For example, the 2000 NAEP in mathematics assessment at grade 4 contained 173 items split across 26 booklets. Each item was developed to measure one of five subscales (a) Number and Operations (b) Measurements (c) Geometry (d) Data Analysis, and (e) Algebra. An item can be multiple-choice or constructed-response. Background (demographic) information are collected on the students through questionnaires that are 1

8 filled out by students, teachers, and school administrators. For example, the questionnaires collected information on 381 background variables for each student in the assessment. The above description clearly shows that NAEP s design and implementation are fundamentally different from those of a large scale testing program. NAEP reports were originally envisaged as simple lists of percents correct to individual survey items (Mislevy, Johnson, & Muraki, 1992), in the population as a whole and in subpopulations of particular interest. However, it was realized later that this approach has severe limitations (e.g., comparison is limited to groups of items common to student subpopulations), and major features of the detailed results from hundreds of items and hundreds of background variables could not be effectively communicated without some kind of statistical modeling. As a result, starting in 1984, NAEP reporting methods used a statistical model consisting of two components: (a) an item response theory (IRT) model, and (b) a linear regression model (see, e.g., Beaton, 1987; Mislevy et al., 1992). Other large scale educational assessments such as the International Adult Literacy Study (IALS; Kirsch, 2001), Trends in Mathematics and Science Study (TIMSS; Martin & Kelly, 1996), and Progress in International Reading Literacy Study (PIRLS; Mullis, Martin, Gonzalez, & Kennedy, 2003) also adopted essentially the same model. This model is referred to as either the conditioning model, multilevel IRT model, or latent regression model. An algorithm for estimating the parameters of this model is implemented in the MGROUP set of programs, an ETS product. The first component of the model (the IRT part) defines the responses of examinees as a set of cognitive items to be dependent on a p-dimensional latent trait/proficiency vector θ = (θ 1,... θ p ). In the second component (linear regression part), the distribution of the proficiency variable θ is modeled by a multivariate, multiple regression on a set of background/predictor variables, which contain information on respective students, schools, and teachers, etc. In order to compute maximum (marginal) likelihood estimates of the parameters of this latent regression model, the proficiency θ (usually multivariate) has to be integrated out; this makes estimation for the model problematic. Mislevy (1984, 1985) shows that the maximum likelihood estimates can be obtained using an expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). The algorithm requires the values of the 2

9 posterior means and the posterior variances for the examinee proficiency parameters. For problems up to two dimensions, the integration is computed using numerical quadrature by the BGROUP version (Beaton, 1987) of the MGROUP program in operational settings in NAEP. For higher dimensions, no numerical integration routine is available operationally (although it may now be possible to perform numerical integration for higher dimensions; work on that is in progress), and an approximation of the integral is used. The version of MGROUP, the current operational procedure used in NAEP and other assessments since 1993, is based on the Laplace approximation (Kass & Steffey, 1989) that ignores the higher order derivatives of the examinee posterior distribution and may not provide accurate results, especially in higher dimensions (e.g., a graphical plot for a data example in Thomas, 1993, shows that overestimates the high examinee posterior variances). A number of extensions to the version of MGROUP or proposals for alternative estimation methods have been suggested. The MCMC estimation approach of Johnson and Jenkins (in press) and Johnson (2002) focuses on implementing a fully Bayesian method for joint estimation of IRT and regression parameters. The direct estimation for normal populations (Cohen & Jiang, 1999) tries to replace the multiple regression by a model that is essentially equivalent to a multigroup IRT model with some additional restrictions on item parameters and group specific variances. The YGROUP version (von Davier & Yu, 2003) of MGROUP implements seemingly unrelated regressions (SUR; Zellner, 1962) and offers generalized least squares estimation of the latent regression model (von Davier & Yon, 2004). None of these alternatives have been entirely satisfactory. The MCMC estimation approach could not be implemented for the operational model with multidimensional proficiencies and the full set of background variables because of the time-consuming nature of MCMC estimation. The direct estimation approach as proposed by Cohen and Jiang (1999) leads to biased estimates even in very small data examples with nonnormal marginal ability distributions (von Davier, 2003). The SUR approach to estimating multivariate regressions is identical to generalized least squares (GLS) if all equations have the same regressors and observed dependent variables (Greene, 2002), a result that cannot be expected to hold in latent regressions. Therefore, there is scope of 3

10 further research in this area. Stochastic EM methods (SEM; Broniatowski, Celeux, & Diebolt, 1983; Celeux & Diebolt, 1985) have been suggested for use in EM algorithms where the E-step is hard to compute/approximate analytically or numerically. In an application of a stochastic EM algorithm, the E-step is handled by simulation. Depending on the nature of the simulation used in the E-step, a stochastic EM method can be a Monte Carlo EM (e.g., Wei & Tanner, 1990), Markov chain Monte Carlo EM (e.g., McCulloch, 1994), rejection sampling EM (e.g., Booth & Hobert, 1999), importance sampling EM (e.g., Booth & Hobert, 1999), etc. Important applications of the algorithm in psychometrics include Fox (2003), Lee, Song, and Lee (2003), Clarkson and Gonzalez (2001) and Meng and Schilling (1996), each of which employ the Markov chain Monte Carlo algorithm (e.g., Gilks, Richardson, & Spiegelhalter, 1996) to simulate in the E-step. The importance sampling EM (e.g., Booth & Hobert, 1999) is a stochastic EM method where each integral in the E-step is approximated by an average over a random sample. This paper examines the prospect of application of the importance sampling EM method to fit the latent regression model in NAEP-like settings. We also present a comparison of the results from the importance sampling EM algorithm with those from the version of MGROUP, which is the technique currently used in NAEP. For a low dimensional real data example, the results from the suggested method are also compared to the BGROUP version of MGROUP, which performs exact numerical integration and is the gold standard method in such settings. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to for fitting latent regression models. 2 NAEP Model, Estimation, and the Current MGROUP Method This section first states the two component statistical model used in NAEP. A brief description of the EM algorithm, which will be helpful later, follows. The next subsection describes the current NAEP estimation process and the MGROUP program used in ETS. 4

11 2.0.1 The Latent Regression Model NAEP and other similar large scale assessments implement a latent regression model utilizing an IRT measurement model. Assume that the unique p-dimensional latent proficiency vector for examinee i is θ i = (θ i1, θ i2,... θ ip ). In NAEP, p could be 2, 3, or 5. Let us denote the response vector to the test items for examinee i as y i = (y i1, y i2,..., y ip ), where, y ik, a vector of responses, contributes information about θ ik. The likelihood for an examinee is given by p f(y i θ i ) = f 1 (y iq θ iq ) L(θ i ; y i ) (1) q=1 The terms f 1 (y iq θ iq ) above follow a univariate IRT model, usually with three-parament logistic (3PL) or generalized partial-credit model (GPCM) likelihood terms. For reasons to be discussed later, the dependence of (1) on the item parameters is suppressed. Suppose x i = (x i1, x i2,... x im ) are m fully measured demographic and educational characteristics for the examinee. Conditional on x i, the examinee proficiency vector θ i is assumed to follow a multivariate normal prior distribution, that is, θ i x i N(Γ x i, Σ). The mean parameter matrix Γ and the non negative definite variance matrix Σ are assumed to be the same for all examinee groups. Under this setup, L(Γ, Σ Y, X), the (marginal) likelihood function for (Γ, Σ) based on the data (X, Y ), is given by n L(Γ, Σ Y, X) = f 1 (y i1 θ i1 )... f 1 (y ip θ ip )φ(θ i Γ x i, Σ)dθ i, (2) i=1 where n is the number of examinees, and φ(..,.) is the multivariate normal density function EM Algorithm The EM algorithm (e.g., Dempster et al., 1977) is commonly used to obtain maximum likelihood estimates or posterior modes in problems with missing data. In fact, the algorithm is applicable in a broad range of situations where probability models can be reexpressed as ones on augmented parameter spaces and the added parameters can be thought of as missing data. The EM algorithm is relevant for mixed models because 5

12 the random effects may be viewed as missing data and the algorithm may be used to find maximum likelihood estimates for these models. Suppose that given an unknown parameter vector ω, data y follows a distribution f(y; ω) and that y = (y obs, y mis), where y obs is the observed data and y mis is the missing data. Suppose the objective is to obtain the maximum likelihood estimate of ω, i.e., the value of ω that maximizes the observed data likelihood f(y obs ω) = f(y obs, y mis ω)dy mis. The EM algorithm goes through a succession of E-steps and M-steps starting from an initial value ω (0) of the parameter vector ω. In the (t + 1)-th E-step, one maximizes Q(ω ω (t) ) = E[log{f(y obs, y mis ω)} y obs, ω (t) ], (3) the expected value of the complete data loglikelihood, where the expectation is with respect to the distribution of y mis given y obs and ω (t). In the corresponding M-step, one maximizes Q(ω ω (t) ) computed in the preceding E-step with respect to ω to get ω (t+1). The algorithm is said to have converged when two successive iterations give very similar values of the parameter vector. It is also important to monitor the value of the loglikelihood f(y obs ω). The EM algorithm is guaranteed to converge to a local maximum so it is customary to start the iterations at many points in the parameter space and then to compare the values of the likelihood at all of the local maxima. This method may be very slow to converge if the proportion of missing data is high. For simple problems, one may be able to do the averaging in (3) analytically. However, in many practical problems (e.g., in NAEP estimation), the expectation cannot be computed analytically; then one may need to use an approximation technique or a simulation technique to compute the expectation NAEP Estimation Process and the MGROUP Program NAEP uses a three-stage estimation process for fitting the above mentioned latent regression model and making inferences. The first stage, scaling, fits a simple IRT 6

13 model (consisting of 3PL and GPCM terms) of the form in (1) to the examinee response data and estimates the item parameters. The prior distribution used in this step is not θ i x i N(Γ x i, Σ) as described above, but is θ i N(0, I), that is, the subscales are assumed to be independent a priori. The second stage, conditioning, assumes that the item parameters are fixed at the estimates found in scaling and fits the model in (2) to the data, that is, estimates Γ and Σ as a first part. In the second part of the conditioning step, plausible values for all examinees are obtained using the parameter estimates obtained in scaling and the first part of conditioning the plausible values are used to estimate examinee subgroup averages. The third stage of the NAEP estimation process, called variance estimation, estimates the variances corresponding to the examinee subgroup averages using a jackknife approach (see, e.g., Johnson & Jenkins, in press). Our research will focus on the conditioning step and assume that the scaling has already been done (i.e., the item parameters are fixed); this is the reason we suppress the dependence of (1) on the item parameters. Because we will be concerned with the conditioning step, the remaining part of the section provides a more detailed discussion of it. The first objective of this step is to estimate Γ and Σ from the data. If the θ i s were known, the maximum likelihood estimators of Γ and Σ would be θ 1 ˆΓ = (X X) 1 X θ 2, (4)... θ n ˆΣ = 1 (θ i Γ x i )(θ i Γ x i ) (5) n i However, θ i s are actually unknown. Mislevy (1984, 1985) shows that the maximum likelihood estimates of Γ and Σ under unknown θ i s can be obtained using an EM algorithm (Dempster et al., 1977). The EM algorithm iterates through a number of expectation steps (E-step) and maximization steps (M-step). The expression for (Γ t+1, Σ t+1 ), the updated 7

14 value of the parameters in the t th M-step, is obtained as: θ 1t Γ t+1 = (X X) 1 X θ 2t, (6)... θ nt [ Σ t+1 = 1 Var(θ i X, Y, Γ t, Σ t ) + n i i ] ( θ it Γ t+1 x i)( θ it Γ t+1 x i), (7) where θ it = E(θ i X, Y, Γ t, Σ t ) is the posterior mean for examinee i given the preliminary parameter estimates of iteration t. The process is repeated until convergence of the estimates Γ and Σ. Formulae (6) and (7) require the values of the posterior means E(θ i X, Y, Γ t, Σ t ) and the posterior variances Var(θ i X, Y, Γ t, Σ t ) for the examinees. Correspondingly, the t th E-step computes the two required quantities for all the examinees. The MGROUP set of programs at ETS perform the EM algorithm mentioned above. The MGROUP program consists of two primary controlling routines called PHASE1 and PHASE2. The former does some preliminary processing while the latter directs the EM iterations. There are different versions of the MGROUP program depending on the method used to perform the E-step in PHASE2: BGROUP using numerical quadrature, NGROUP (Mislevy, 1985) using Bayesian normal theory, (Thomas, 1993) using Laplace approximations, YGROUP (von Davier & Yu, 2003) using SUR, and so on. The BGROUP version of MGROUP, applied when the dimension of θ i is less than or equal to two, applies numerical quadrature to approximate the integral. When the dimension of θ i is larger than two, NGROUP,, or YGROUP may be used. Note that N-group and Y-group are not used for operational purposes for the following reasons: N-group assumes normality of both the likelihood and the prior and may produce biased results if these assumptions are not met (Thomas, 1993). YGROUP performs the estimation independently by implementing seemingly unrelated regressions (Zellner, 1962), which is useful for generating starting values, but does not capitalize on the covariances between the p components of the latent trait. This may be inappropriate, for example, if the latent trait θ i is measured with very different precisions across components θ iq for some 8

15 or all subjects. The following subsection provides details about, the approach currently used operationally in several large scale assessments including NAEP. This approach is based on approximation of posterior moments of a multivariate distribution using Laplace approximation. E-step. The version of MGROUP uses the Laplace approximation of the posterior mean and variance, that is, uses where, E (θ j X, Y, Γ t, Σ t ) ˆθ j,mode 1 2 r=1 Cov (θ j, θ k X, Y, Γ t, Σ t ) G jk 1 p (G jr G kr G rr ) 2 ĥ(4) r + r=1 1 p r {[1 12 ] 2 I(r = s) ĥ (3) r r=1 s=1 ˆθ j,mode = j th component of the posterior mode of θ, p G jr G rr ĥ (3) r, j = 1, 2,... p, (8) ĥ (3) s G rs (G js G ks G rr + G js G kr G rs + G jr G ks G rs + } G jr G kr G ss ), j, k = 1, 2,... p, (9) h = log[f(y θ)φ(θ Γ tx, Σ t )] ( h ) 1 G jk =, (10) θ r θ s ˆθmode jk ĥ n r = n th pure partial derivative withrespectto θ r, evaluated at ˆθ mode, and I(r = s) is 1 if r = s and 0 otherwise. Details about the formulae (8) and (9) can be found in Thomas (1993, pp ). 1 The Laplace method does not provide an unbiased estimate of the quantity it is approximating and may provide inaccurate results if higher order derivatives (that the Laplace method assumes to be equal to zero) are not negligible. The error of approximation for each component of the mean and covariance of θ i is of order O( 1 k 2 ) (e.g., Kass & Steffey, 1989), where k is the number of items measuring skill corresponding to the component. Because the number of items given to each examinee in large scale assessments such as 9

16 NAEP is not too large (making k rather small), the errors in the Laplace approximation may become nonnegligible, especially for high-dimensional θ i s. Further, if the posterior distribution of θ i s is multimodal (which is not impossible, especially for a small number of items), the method can perform poorly. Therefore the version of MGROUP is not entirely satisfactory. Figure 1 in Thomas (1993), where the posterior variance estimates of 500 randomly selected examinees using the Laplace method and exact numerical integration for two-dimensional θ i are plotted, shows that the Laplace method provides inflated variance estimates for examinees with large posterior variance (see also Section 6 in this report). The departure maybe more severe for θ i s in higher dimensions. 3 The Suggested Stochastic EM Method Before going into the details of the suggested technique, we provide a brief description of the importance sampling method (e.g., Geweke, 1989.). 3.1 Importance Sampling The Radon-Nikodym theorem (e.g., Bauer, 1972) forms the basis for importance sampling. This theorem states that a measure M over Ω can be expressed in terms of another measure N over Ω M(A) = A fdn A m(ω)dω = A n(ω)f(ω)dω if N(A) = 0 M(A) = 0 for all A. The function f is referred to as the Radon-Nikodym derivative and is sometimes written as f = M. In the case that m, n are probability N densities over Ω = R p the theorem implies f(ω)ωn(ω)dω = which is equivalent to E M (ω) = E N (fω). Ω Ω ωm(ω)dω, This result is applied in importance sampling to approximate intractable integrals. Suppose we want to compute the expected value of a function g(ω) where the expectation is taken with respect to a probability density f(ω). We can express the expectation of g(ω) as E f {g(ω)} = ω g(ω)f(ω)dω = 10 ω g(ω)f(ω) h(ω)dω (11) h(ω)

17 for any probability density h(ω) defined on the same sample space that is zero only if f(ω) is zero. Now, if we can generate a random sample ω 1, ω 2,..., ω n from the distribution with density h(ω), we can approximate E f {g(ω)}, using (11), as E f {g(ω)} 1 n n i=1 g( ω i )f( ω i ) (12) h( ω i ) The standard error corresponding to the above estimate is given by dividing the standard deviation of the importance ratios g( ω i)f( ω i ) h( ω i ) by n. The standard deviation mentioned above determines the quality of the approximation. The density h(ω) is called the importance sampling density. If h(ω) is chosen so that the importance ratio g(ω)f(ω) h(ω) is roughly constant over the range of possible values of ω (which will make the standard deviation of the importance ratios small), fairly accurate approximation of the integral may be obtained. In particular, h(ω) should have heavier tails than the product g(ω)f(ω) since otherwise there may be a few ω i s for which h(ω) will be much smaller than g(ω)f(ω) and the estimate will blow up. The Radon-Nikodym theorem requires h = 0 g = 0 to hold, which can be stated in A A an approximate fashion as: Wherever h is zero, g needs to be zero. This is the theoretical justification for the heavier tail requirement of h over gf. The main advantages of importance sampling are that it provides an unbiased estimate of the integral and the accuracy can be monitored by examining the standard deviation of the g( ω i)f( ω i ) h( ω i ) values. In some situations, the researcher knows a distribution up to a constant only, but still needs to compute the moments of the distribution. For example, often that all that is available in a Bayesian analysis is a multiple of the posterior (by multiplying the likelihood and the prior), but not the posterior itself. Importance sampling can be used in this situation as well. Suppose, in the above setup, one does not know f(ω), but does know q(ω) c.f(ω). The expectation of a function g(ω) is given by ω E f {g(ω)} = g(ω)q(ω)dω ω q(ω)dω 11

18 The numerator in the right-hand side of above is approximated by g(ω)q(ω) g(ω)q(ω)dω = h(ω)dω 1 n g( ω i )q( ω i ) h(ω) n h( ω i ) ω while the denominator is approximated by n q( ω i ) i=1 h( ω i ) ω i=1 (see, e.g., Gelman, Carlin, Stern, & Rubin, 2003, pp ). Geweke (1989) shows that the estimate above converges almost surely to the estimand under weak assumptions and a central limit theorem (establishing that n 1/2 ( Ef {g(ω)} E f {g(ω)}) N (0, τ 2 ), where τ 2 can be estimated consistently) applies under stronger assumptions. The estimation of ratio of two quantities by the ratio of estimates of them is known as ratio estimation. For more details about ratio estimation, see, for example, Cochran (1977). 3.2 Importance Sampling in the E-step of the MGROUP Program This paper suggests approximating the posterior expectation and variance of the examinee proficiencies θ i in (6) and (7) using the importance sampling method. The posterior distribution of θ i, denoted as p(θ i X, Y, Γ t, Σ t ), is given by p(θ i X, Y, Γ t, Σ t ) f(y i1 θ i1 )... f(y ip θ ip )φ(θ Γ t x i, Σ t ) (13) using (2). The proportionality constant in (13) is a function of y i, Γ t, and Σ t. Let us denote q(θ i X, Y, Γ t, Σ t ) f(y i1 θ i1 )... f(y ip θ ip )φ(θ Γ t x i, Σ t ). (14) We drop the subscript i for convenience for the rest of the section and let θ denote the proficiency of an examinee. We have to compute the mean and variance of p(θ X, Y, Γ t, Σ t ), that is, the quantities E(θ X, Y, Γ t, Σ t ) θp(θ X, Y, Γ t, Σ t )dθ, and. (15) Var(θ X, Y, Γ t, Σ t ) (θ E(θ X, Y, Γ t, Σ t ))(θ E(θ X, Y, Γ t, Σ t )) p(θ X, Y, Γ t, Σ t )dθ (16) 12

19 Using ideas from the above description of the importance sampling method, if we can generate a random sample θ 1, θ 2,..., θ n from a distribution h(θ) approximating p(θ X, Y, Γ t, Σ t ) reasonably, we can approximate (15) by the ratio of and 1 n 1 n n θ j q(θ j X, Y, Γ t, Σ t ) j=1 h(θ j ) n q(θ j X, Y, Γ t, Σ t ) h(θ j. ) j=1 Similarly, we can approximate (16) as the ratio of 1 n n j=1 (θ j E(θ j X, Y, Γ t, Σ t ))(θ j E(θ j X, Y, Γ t, Σ t )) q(θ j X, Y, Γ t, Σ t ) h(θ j ) and. 1 n n q(θ j X, Y, Γ t, Σ t ) j=1 h(θ j ) Following Booth and Hobert (1999), who apply the importance sampling EM method for fitting generalized linear mixed models, we use a multivariate t importance sampling density. Booth and Hobert (1999) argue that in general, a multivariate t density is a better choice as an importance sampling density than a normal density. One reason is that the former is heavy-tailed while the latter is not, and that a good importance sampling density should be heavy-tailed when the density it approximates may be so; also, the t density is easy to sample from. The degrees of freedom of the t density is taken to be four because low degrees of freedom of a t density ensures that it has a heavy tail. (The choice of optimal degrees of freedom is an issue requiring further investigation.) The mean and variance of the t importance sampling density are taken as input from a nonstochastic version of MGROUP. In our implementation, the YGROUP version (von Davier & Yu, 2003) is used, which conveniently provides input for the importance sampling density in the stochastic EM algorithm. It is also possible to estimate variances of the parameter estimates obtained from the EM algorithm as suggested by, for example, Booth and Hobert (1999). However, we are 13

20 more interested in the estimation of the latent regression parameters in this work and do not delve into the issue here. Note that a number of studies have successfully applied other forms of stochastic EM methods, such as rejection sampling EM (Booth & Hobert, 1999) and Markov chain Monte Carlo EM, or MCMC-EM, (e.g., McCulloch, 1994), but we do not explore those approaches, mainly because of the difficulties in their application to our problem. 4 MCEM in MGROUP This section describes the implementation of (MCEM is short for Markov chain Monte Carlo EM), the version of MGROUP that uses importance sampling to generate the necessary statistics in the E-Step. Because of the presence of a stochastic component in the E-step, this technique is also called stochastic EM algorithm. Following the notes on the implementation, differences between assessing convergence in nonstochastic optimization and in stochastic optimization algorithms will be discussed. 4.1 Implementing The MCEM algorithm was integrated into YGROUP (von Davier, 2003), which is based on BGROUP and implements SUR (Zellner, 1962) and various other modifications. Following Booth and Hobert (1999), importance sampling is used in the E-step to generate the statistics (here, posterior means, residual components, and measurement error components) necessary to perform the EM iterations for the latent regression. The implementation uses importance sampling, employing a multivariate t (as in Booth & Hobert, 1999). The version as implemented in YGROUP uses the SUR posterior means and the posterior residual (co-)variance matrix as fast and efficient means to compute the mean and variance of the importance sampling density. Like in all versions of MGROUP, optional starting values for the latent regression (Γ, Σ) can be provided but are not essential. As an alternative, initial iterations with a nonstochastic E-step of one of the other versions can be carried out to generate starting values. The initial sample size of the importance sample can be chosen to be much smaller 14

21 than the final sample size, in order to speed up initial iterations where individual accuracy is less important than approaching a good starting value for the aggregate parameters. The examples below use 500 as initial importance sample size and 6,000 as final importance sample size. 4.2 Determining Convergence of the MCEM Algorithm When applying a stochastic EM algorithm, it is possible that the EM algorithm has not converged, but appears so after an EM step (i.e., updated parameters and likelihoods do not change much from the previous step) because of an unlucky random sample. One should therefore use a stricter convergence criterion, or a larger variety of criteria that all have to be met simultaneously, in order to ensure the convergence of the algorithm. We monitor the likelihood function and the parameter vector for convergence and conclude that convergence has occurred only when the relative change in both these quantities are less than ɛ in five successive iterations. The likelihood increases over the iterations in a nonstochastic EM algorithm (see, e.g., Dempster et al., 1977) while it may not be so for a stochastic EM algorithm because of Monte Carlo error. In order to assess convergence in stochastic optimization algorithms, the inevitable nonmonotonicity of these algorithms has to be taken into account. Each E-step in our implementation employs a different set of points (the importance sample) in the multivariate quadrature grid in order to perform the integration. This means that the likelihood of each observed response vector may vary, and hence, the likelihood of the whole sample may vary even if all other parameters are unchanged, from one importance sample to the next. Therefore, in our implementation, we monitor absolute changes in the log likelihood, the regression parameters Γ, and the residual (co-)variance matrix Σ simultaneously. The maximum absolute changes are checked against user-defined stopping criteria, which are used to decide whether a significant change has occurred that demands further iterations towards the optimum. The algorithm stops if, and only if, all three absolute changes are smaller than user-defined numbers. In addition to that, the absolute changes of previous 15

22 iterations are integrated with the current change by using the following approach: For cycle t, define the average absolute maximum change (AAMC) as AAMC x,t = p AMC x,t + (1 p) AAMC x,t 1, which averages the current absolute maximum change (AM C) and the previous average absolute maximum change (AAM C). The x denotes a parameter (vector) or a function of parameters (for example, the maximum change in regression parameters in our case). t denotes the current iteration (cycle) of the algorithm; t 1, the previous cycle; and 0 (1 p) 1.0, the weight of the averaged previous cycles criterion. This criterion ensures that, if p < 1, more than the current change (which might be small due to the stochastic nature of the algorithm) is taken into account when deciding whether to stop iterations. If p = 1, we have a regular (no memory) stopping criterion, which stops whenever the current iteration alone meets the stopping rule. In the current implementation, the stopping rule fades out the past absolute changes rather slowly. We found that p = 0.4 and the AAMC bound of for relative change in likelihood, regression weights, and variances works reasonably well. This choice balances infinite iterations against premature termination of the algorithm in the examples presented here. The rule for stopping the algorithm is that AAMC x,t < has to be satisfied for termination, a maximum change bound similar to what Booth and Hobert (1999) report. 5 Analysis of Simulated Data This section presents the first proof-of-concept example. The simulated data set used for this example matches the structure and size of the 2000 NAEP mathematics assessment at grade 4 and comes from a previous study (von Davier & Yu, 2003). The assessment has five scales: (a) Number and Operations, (b) Measurements, (c) Geometry, (d) Data Analysis, and (e) Algebra. This section applies the stochastic EM method to find the parameter estimates for the simulated data set, which involves responses of 13,511 students. Each student responds to 1 of 26 sets of items (booklets). Each booklet contains three blocks of items (both multiple-choice and constructed-response) out of a total 145 items. Each examinee has information on 381 predictors that are used in the latent 16

23 regression model. This section compares estimates from the and the operational of the posterior means and standard deviations of examinees, the regression parameters Γ, and the residual variance-covariance matrix Σ. As an additional benchmark, both the results of and are compared against the generating values, that is, the individual latent trait vectors used to simulate the data. 5.1 Estimates of Γ and Σ for the Simulated Data Figure 1 shows a comparison of the and regression parameters estimates (i.e., estimates of the individual parameters in Γ) for the 2000 NAEP mathematics assessment at grade 4 by subscale. The figure points to the extreme closeness of the estimates from the two approaches. Table 1 shows the residual variance-covariance estimates (i.e., estimates of the individual parameters in Σ) from and for the simulated data set. The table shows slightly larger estimated variances for the MCEM version of MGROUP as compared to. The pattern is the opposite for the estimated covariances. The differences between the covariance estimates of and seem smaller than the differences between the variance estimates of the two approaches. 5.2 Subgroup Estimates for the 2000 NAEP Mathematics Assessment at Grade 4 Figures 2 to 4 present recovery plots for and. The figures also show subplots comparing the individual means and standard deviations from and. Each subscale has four subplots. The recovery plots are based on individual examinee statistics, and hence will show more variance than that in the population parameters Γ and Σ. In the recovery plots, the generating values are the reference, and the conditional posterior moments (given the responses, the background information, and the maximum likelihood estimates of Γ and Σ, not the random draws 17

24 All Gamma NUM&OP All Gamma MEASUREMENT All Gamma GEOMETRY All Gamma DATA ANALYSIS All Gamma ALGEBRA Figure 1. Regression parameter (Γ) estimates from and for the data for 2000 NAEP mathematics assessment at grade 4. of these parameters used when generating imputations with MGROUP) as generated by and are plotted against these reference values. For all subscale displays, the ideal recovery or agreement line, respectively, is indicated by the main diagonal given in each plot. Figures 2 to 4 show no obvious differences between and when comparing the plots showing recovery of the generating θ values. This observation finds additional support when examining the plot of both posterior mean estimates. The posterior means generated by and fall nicely around the diagonal line and the plot is much narrower than the truth recovery plots. This indicates that and agree quite well when producing posterior means, a statement that cannot be said for the posterior standard deviation. 18

25 Num&Op Posterior mean Num&Op Posterior mean True Theta True Theta Num&Op Posterior mean Num&Op Posterior SD Measurement Posterior mean Measurement Posterior mean True Theta True Theta Measurement Posterior mean Measurement Posterior SD Figure 2. Posterior moments from and compared against the generating values and the posterior means and standard deviations from and compared against each other. 19

26 Geometry Posterior mean Geometry Posterior mean True Theta Geometry Posterior mean True Theta Geometry Posterior SD Data Analysis Posterior mean Data Analysis Posterior mean True Theta Data Analysis Posterior mean True Theta Data Analysis Posterior SD Figure 3. Posterior moments from and compared against the generating values and the posterior means and standard deviations from and compared against each other. 20

27 Table 1. Residual Variance Covariance Estimates From and NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA estimates NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA estimates NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA Note. Based on the operational population model for the 2000 NAEP mathematics assessment at grade 4. The posterior standard deviations given by are considerably larger than for for all subscales. It is to be determined whether this is a systematic difference, whether it is due to Monte Carlo inflation of the variance, or whether this depends on some other structural variable such as the correlation between scales or the sample size. Table 5.2 compares the subgroup means from and for relevant subgroups there seems to be little difference between the two methods from this aspect as well. Table 5.2 results show that does not perform much differently from when reproducing group level statistics such as subgroup means and variances. 21

28 Algebra Posterior mean Algebra Posterior mean True Theta True Theta Algebra Posterior mean Algebra Posterior SD Figure 4. Posterior moments from and compared against the generating values and the posterior means and standard deviations from and compared against each other. 6 Analysis of Real NAEP Data: 2003 Reading Assessment at Grade 4 The 2003 NAEP reading assessment at Grade 4 has data from 187,581 examinees in the fourth grade or 9 years of age. Each student answers a literary block (to assess the ability to read for literary experience) and an informative block (to assess the ability to read for information). Each block consists of 10 to 12 items, with a mixture of multiple-choice and constructed-response items. We will refer to the two skills (subscales) measured by the assessment as literary and information. The combined item sample for these two scales has 111 items in total; 688 background variables are used in the latent regression model. Because the θ i s are two-dimensional, it is possible to apply BGROUP (which performs exact numerical integration in the E-step) here. BGROUP provides the gold standard, and we will compare the results obtained from the application of the stochastic EM against it. Comparison with results from will be provided as well. 22

29 Table 2. Comparison of Subgroup Estimates From and The subgroup NumOp Meas Geom DA Alg NumOp Meas Geom DA Alg Overall Male Female White Black Hispanic Asian Amerind Estimates of Γ and Σ for the Real Data Table 3 shows the residual variance estimates ˆΣ as generated by BGROUP,, and for the NAEP data. Table 3. Residual Variances, Covariances, and Correlations for the 2003 NAEP Reading Assessment at Grade 4 Data BGROUP Lit. Info. Lit. Info. Lit. Info. Lit Info Note. Lit. = Literary, Info. = Information, Residual variances = main diagonals, covariances = upper off-diagonals,and correlations = lower off-diagonals. The three estimates for the residual correlation and the residual variances are close. Interestingly, the estimates generated by are slightly smaller than 23

30 estimates for the reading data, whereas the residual variances estimates for the simulated math were larger for than for. As an outcome, the correlation is higher for than for. Whether this is an effect of the sample size or the number of scales needs further investigation. Figure 5 shows the regression parameter estimates as generated by and plotted against the estimates generated by the gold standard for this data example, BGROUP All Gamma LITERARY BGROUP All Gamma INFORMATION BGROUP All Gamma LITERARY All Gamma INFORMATION BGROUP BGROUP Figure 5. Regression coefficients for the 2003 NAEP reading assessment at grade 4. The figure shows that the 688 regression coefficient estimates each for the literary and the information scales agree to a large extent between the different versions of MGROUP. Almost all points are found on the main diagonal, which indicates that the ˆΓ estimates produced by BGROUP,, and are virtually identical. 6.2 Subgroup Estimates for the 2003 NAEP Reading Assessment at Grade 4 Figure 6.2 presents plots of the posterior means as generated by and compared to the posterior means generated by BGROUP. Figure 7 24

31 LITERARY Posterior mean LITERARY Posterior mean LITERARY Posterior mean BGROUP BGROUP Information Posterior mean Information Posterior mean Information Posterior mean BGROUP BGROUP Figure 6. Posterior means as generated by and and plotted against the reference, BGROUP, for the NAEP 2003 grade 4 reading data. compares the corresponding posterior standard deviations. In contrast to the first example (that involving simulated data), there is no obvious large difference between the estimates of the posterior means or standard deviations (SDs) generated by and. Both approaches differ slightly from the estimates generated by BGROUP. overestimates the posterior standard deviation for larger values when compared to BGROUP. The graph comparing BGROUP and posterior SDs shows this effect in a very similar way to what Thomas (1993) reports in his Figure 1., on the other hand, stays closer to BGROUP for large posterior standard deviations than does. For small standard deviations, seems to produce slightly larger values than BGROUP, while the estimates generated by and BGROUP agree better with smaller standard deviations. Except for a few outliers, the posterior means as generated by agree nicely with the values produced by both BGROUP and. There is no obvious 25

32 LITERARY Posterior SD LITERARY Posterior SD LITERARY Posterior SD BGROUP BGROUP Information Posterior SD Information Posterior SD Information Posterior SD BGROUP BGROUP Figure 7. Posterior standard deviations as generated by and MCEM- GROUP and plotted against the reference, BGROUP, for the 2003 NAEP reading assessment at grade 4 data. indication of systematic over- or underestimation for the vast majority of subjects. The obvious outliers seen in these plots are subject to ongoing research aimed at optimizing and stabilizing the current implementation of. It will be investigated whether the stochastic integration failed to produce meaningful results due to poor coverage of the true posterior by the importance sampling distribution or whether these values are caused by small size of the importance sample, or for other less obvious reasons. Table 4 shows the subgroup estimates and the corresponding standard deviations (in parenthesis) provided by the three methods. There is hardly any differences between and so far as subgroup estimates are concerned. However, when compared to BGROUP results, both and slightly underestimate the subgroup means. 26

33 Table 4. Comparison of Subgroup Estimates From BGROUP,, and Pop. BGROUP sub. Lit. Info. Lit. Info. Lit. Info. Overall (0.95) (0.98) (0.96) (0.99) (0.94) (0.98) Male (0.96) (1.00) (0.96) (1.01) (0.94) (0.99) Female (0.94) (0.97) (0.94) (0.98) (0.93) (0.96) White (0.87) (0.89) (0.87) (0.90) (0.86) (0.89) Black (0.92) (0.92) (0.92) (0.93) (0.91) (0.91) Hisp (0.94) (0.94) (0.94) (0.95) (0.93) (0.93) Asian (0.93) (0.97) (0.94) (0.98) (0.92) (0.96) AmInd (0.94) (0.94) (0.95) (0.95) (0.92) (0.93) Public (0.96) (0.99) (0.96) (1.00) (0.94) (0.98) Private (0.83) (0.87) (0.83) (0.88) (0.82) (0.86) Note. Pop. sub. = Population subgroup, Lit. = Literary, Info. = Information. Hisp. = Hispanic. Asian = Asian Pacific, AmInd = American Indian. 7 Conclusions and Future Work is the current operational method used in large scale assessments such as NAEP. Though provides more accurate results than its predecessor (NGROUP), it is not without problems, as demonstrated by Thomas (1993); especially, is found to inflate variance estimates for examinees with large posterior variances. As of now, there is no entirely satisfactory alternative to. As this work shows, a stochastic EM method using importance sampling provides a viable alternative to. Application of the importance sampling EM method to a simulated data set and a real data set reveals interesting facts. The regression parameter estimates are extremely close for BGROUP (the gold standard), (the current operational version of MGROUP used in NAEP, etc.), and the newly proposed. The estimates of conditional posterior means seem 27

34 to be quite close for and. In the real data example, both of these set of estimates are very close to the BGROUP estimates. The MCEM algorithm produced a few outlying estimated conditional posterior means, especially in the real data example, that may be due to MC error; this issue needs further investigations. Statements about agreement between posterior standard deviations cannot be clearly made as of now. For the real data example, which has a large sample size, provides inflated estimates of posterior SDs for examinees with large SDs, while the does not. However, results in a few outliers in the middle of the range of the SDs. Interestingly, the pattern is quite different for the simulated data, where the estimates of posterior SDs are bigger than those from. Overall, the stochastic EM method performs slightly better than the version of MGROUP. One problem with the stochastic EM method is that it is very time-consuming. For example, for our real data example (2003 reading data), while the BGROUP takes 8 hours and takes 4 hours on a Pentium IV computer with 2.2 Ghz, takes about 48 hours, which is longer than by a factor of 12. However, with the advent of increasingly fast computers, application of will become more feasible; at least this method can be used to get a second opinion right now. It is possible to improve the efficiency of the importance sampling EM method even further. Let us consider estimation of E(θ X, Y, Γ t, Σ t ), which is given in (15). Let us denote θ 0 to be the mode of p(θ X, Y, Γ t, Σ t ). We have, E(θ X, Y, Γ t, Σ t ) θp(θ X, Y, Γ t, Σ t )dθ θexp{u(θ)}dθ, for exp{u(θ)} p(θ X, Y, Γ t, Σ t ) θexp{u(θ 0 ) (θ θ 0) u (θ 0 )(θ θ 0 ) + (θ)}dθ as u (θ 0 ) 0 { } e u(θ 0) 1 θexp 2 (θ θ 0) u (θ 0 )(θ θ 0 ) exp( (θ))dθ where (θ) = u(θ) u(θ 0 ) 1 2 (θ θ 0) u (θ 0 )(θ θ 0 ) The first term under exponentiation in the last step above forms a normal distribution and (θ) is much less variable (and close to zero) over the range of θ than is u(θ). Therefore, using the above normal distribution as the importance sampling density 28

Extension of the NAEP BGROUP Program to Higher Dimensions

Extension of the NAEP BGROUP Program to Higher Dimensions Research Report Extension of the NAEP Program to Higher Dimensions Sandip Sinharay Matthias von Davier Research & Development December 2005 RR-05-27 Extension of the NAEP Program to Higher Dimensions Sandip