Application of the Stochastic EM Method to Latent Regression Models
|
|
- Jacob Virgil Harvey
- 5 years ago
- Views:
Transcription
1 Research Report Application of the Stochastic EM Method to Latent Regression Models Matthias von Davier Sandip Sinharay Research & Development August 2004 RR-04-34
2
3 Application of the Stochastic EM Method to Latent Regression Models Matthias von Davier and Sandip Sinharay ETS, Princeton, NJ August 2004
4 ETS Research Reports provide preliminary and limited dissemination of ETS research prior to publication. To obtain a PDF or a print copy of a report, please visit:
5 Abstract The reporting methods used in large scale assessments such as the National Assessment of Educational Progress (NAEP) rely on a latent regression model. The first component of the model consists of a p-scale IRT measurement model that defines the response probabilities on a set of cognitive items in p scales depending on a p-dimensional latent trait variable θ = (θ 1,... θ p ). In the second component, the conditional distribution of this latent trait variable θ is modeled by a multivariate, multiple regression on a set of predictor variables, which are usually based on student, school and teacher variables in assessments such as NAEP. In order to fit the latent regression model using the maximum (marginal) likelihood estimation technique, multivariate integrals have to be evaluated. In the computer program MGROUP used by ETS for fitting the latent regression model to data from NAEP and other sources, the integration is currently done either by numerical quadrature (for problems up to two dimensions) or by an approximation of the integral., the current operational version of the MGROUP program used in NAEP and other assessments since 1993, is based on Laplace approximation that may not provide fully satisfactory results, especially if the number of items per scale is small. This paper examines the application of stochastic expectation-maximization (EM) methods (where an integral is approximated by an average over a random sample) to NAEP-like settings. We present a comparison of with a promising implementation of the stochastic EM algorithm that utilizes importance sampling. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to for fitting multivariate latent regression models. Key words:, NAEP, importance sampling, stochastic integration i
6 Acknowledgements The authors thank John Mazzeo, Shelby Haberman, Alina von Davier, Neal Thomas, Andreas Oranje, Ying Jin, and Matthew Johnson for useful advice, Steve Isham for help with the data sets used in the analysis, and Kim Fryer for help with proofreading. ii
7 1 Introduction National Assessment of Educational Progress (NAEP), the only regularly administered and congressionally mandated national assessment program (see, e.g., Beaton & Zwick, 1992), is an ongoing survey of the academic achievement of the school students in the United States in a number of subject areas such as reading, writing, mathematics, and the like. It is administered by the National Center for Educational Statistics (NCES), a part of the U.S. Department of Education, to a selected sample of students in Grades 4, 8, and 12. A document called Nation s Report Card reports the results of NAEP for a number of academic subjects on a regular basis. A comparison is provided to the previous assessments in each subject area. The academic achievement is described as average student proficiency for all students in the U.S. and as the percentage of students attaining fixed levels of proficiency (the levels being defined by the National Assessment Government Board as what students should know at each grade level) in different subjects. In addition to producing these numbers for the nation as a whole, NAEP reports the same results for different subpopulations (based on gender, race, school-type, etc.) of the student population. NAEP is prohibited by law from reporting results for individual students, schools, or school districts and is designed to obtain optimal estimates of subpopulation characteristics rather than those of individual performance. To assess national performance in a valid way, NAEP must sample a wide and diverse body of student knowledge. To avoid the burden involved with presenting each item to every examinee, NAEP selects students randomly from designated grade and age populations (first, a sample of schools are selected according to a detailed stratified sampling plan, as mentioned in, e.g., Beaton & Zwick, 1992, and then students are sampled within the schools). NAEP then administers one of many possible booklets of items to each student. This process is sometimes referred to as matrix sampling of items. For example, the 2000 NAEP in mathematics assessment at grade 4 contained 173 items split across 26 booklets. Each item was developed to measure one of five subscales (a) Number and Operations (b) Measurements (c) Geometry (d) Data Analysis, and (e) Algebra. An item can be multiple-choice or constructed-response. Background (demographic) information are collected on the students through questionnaires that are 1
8 filled out by students, teachers, and school administrators. For example, the questionnaires collected information on 381 background variables for each student in the assessment. The above description clearly shows that NAEP s design and implementation are fundamentally different from those of a large scale testing program. NAEP reports were originally envisaged as simple lists of percents correct to individual survey items (Mislevy, Johnson, & Muraki, 1992), in the population as a whole and in subpopulations of particular interest. However, it was realized later that this approach has severe limitations (e.g., comparison is limited to groups of items common to student subpopulations), and major features of the detailed results from hundreds of items and hundreds of background variables could not be effectively communicated without some kind of statistical modeling. As a result, starting in 1984, NAEP reporting methods used a statistical model consisting of two components: (a) an item response theory (IRT) model, and (b) a linear regression model (see, e.g., Beaton, 1987; Mislevy et al., 1992). Other large scale educational assessments such as the International Adult Literacy Study (IALS; Kirsch, 2001), Trends in Mathematics and Science Study (TIMSS; Martin & Kelly, 1996), and Progress in International Reading Literacy Study (PIRLS; Mullis, Martin, Gonzalez, & Kennedy, 2003) also adopted essentially the same model. This model is referred to as either the conditioning model, multilevel IRT model, or latent regression model. An algorithm for estimating the parameters of this model is implemented in the MGROUP set of programs, an ETS product. The first component of the model (the IRT part) defines the responses of examinees as a set of cognitive items to be dependent on a p-dimensional latent trait/proficiency vector θ = (θ 1,... θ p ). In the second component (linear regression part), the distribution of the proficiency variable θ is modeled by a multivariate, multiple regression on a set of background/predictor variables, which contain information on respective students, schools, and teachers, etc. In order to compute maximum (marginal) likelihood estimates of the parameters of this latent regression model, the proficiency θ (usually multivariate) has to be integrated out; this makes estimation for the model problematic. Mislevy (1984, 1985) shows that the maximum likelihood estimates can be obtained using an expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). The algorithm requires the values of the 2
9 posterior means and the posterior variances for the examinee proficiency parameters. For problems up to two dimensions, the integration is computed using numerical quadrature by the BGROUP version (Beaton, 1987) of the MGROUP program in operational settings in NAEP. For higher dimensions, no numerical integration routine is available operationally (although it may now be possible to perform numerical integration for higher dimensions; work on that is in progress), and an approximation of the integral is used. The version of MGROUP, the current operational procedure used in NAEP and other assessments since 1993, is based on the Laplace approximation (Kass & Steffey, 1989) that ignores the higher order derivatives of the examinee posterior distribution and may not provide accurate results, especially in higher dimensions (e.g., a graphical plot for a data example in Thomas, 1993, shows that overestimates the high examinee posterior variances). A number of extensions to the version of MGROUP or proposals for alternative estimation methods have been suggested. The MCMC estimation approach of Johnson and Jenkins (in press) and Johnson (2002) focuses on implementing a fully Bayesian method for joint estimation of IRT and regression parameters. The direct estimation for normal populations (Cohen & Jiang, 1999) tries to replace the multiple regression by a model that is essentially equivalent to a multigroup IRT model with some additional restrictions on item parameters and group specific variances. The YGROUP version (von Davier & Yu, 2003) of MGROUP implements seemingly unrelated regressions (SUR; Zellner, 1962) and offers generalized least squares estimation of the latent regression model (von Davier & Yon, 2004). None of these alternatives have been entirely satisfactory. The MCMC estimation approach could not be implemented for the operational model with multidimensional proficiencies and the full set of background variables because of the time-consuming nature of MCMC estimation. The direct estimation approach as proposed by Cohen and Jiang (1999) leads to biased estimates even in very small data examples with nonnormal marginal ability distributions (von Davier, 2003). The SUR approach to estimating multivariate regressions is identical to generalized least squares (GLS) if all equations have the same regressors and observed dependent variables (Greene, 2002), a result that cannot be expected to hold in latent regressions. Therefore, there is scope of 3
10 further research in this area. Stochastic EM methods (SEM; Broniatowski, Celeux, & Diebolt, 1983; Celeux & Diebolt, 1985) have been suggested for use in EM algorithms where the E-step is hard to compute/approximate analytically or numerically. In an application of a stochastic EM algorithm, the E-step is handled by simulation. Depending on the nature of the simulation used in the E-step, a stochastic EM method can be a Monte Carlo EM (e.g., Wei & Tanner, 1990), Markov chain Monte Carlo EM (e.g., McCulloch, 1994), rejection sampling EM (e.g., Booth & Hobert, 1999), importance sampling EM (e.g., Booth & Hobert, 1999), etc. Important applications of the algorithm in psychometrics include Fox (2003), Lee, Song, and Lee (2003), Clarkson and Gonzalez (2001) and Meng and Schilling (1996), each of which employ the Markov chain Monte Carlo algorithm (e.g., Gilks, Richardson, & Spiegelhalter, 1996) to simulate in the E-step. The importance sampling EM (e.g., Booth & Hobert, 1999) is a stochastic EM method where each integral in the E-step is approximated by an average over a random sample. This paper examines the prospect of application of the importance sampling EM method to fit the latent regression model in NAEP-like settings. We also present a comparison of the results from the importance sampling EM algorithm with those from the version of MGROUP, which is the technique currently used in NAEP. For a low dimensional real data example, the results from the suggested method are also compared to the BGROUP version of MGROUP, which performs exact numerical integration and is the gold standard method in such settings. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to for fitting latent regression models. 2 NAEP Model, Estimation, and the Current MGROUP Method This section first states the two component statistical model used in NAEP. A brief description of the EM algorithm, which will be helpful later, follows. The next subsection describes the current NAEP estimation process and the MGROUP program used in ETS. 4
11 2.0.1 The Latent Regression Model NAEP and other similar large scale assessments implement a latent regression model utilizing an IRT measurement model. Assume that the unique p-dimensional latent proficiency vector for examinee i is θ i = (θ i1, θ i2,... θ ip ). In NAEP, p could be 2, 3, or 5. Let us denote the response vector to the test items for examinee i as y i = (y i1, y i2,..., y ip ), where, y ik, a vector of responses, contributes information about θ ik. The likelihood for an examinee is given by p f(y i θ i ) = f 1 (y iq θ iq ) L(θ i ; y i ) (1) q=1 The terms f 1 (y iq θ iq ) above follow a univariate IRT model, usually with three-parament logistic (3PL) or generalized partial-credit model (GPCM) likelihood terms. For reasons to be discussed later, the dependence of (1) on the item parameters is suppressed. Suppose x i = (x i1, x i2,... x im ) are m fully measured demographic and educational characteristics for the examinee. Conditional on x i, the examinee proficiency vector θ i is assumed to follow a multivariate normal prior distribution, that is, θ i x i N(Γ x i, Σ). The mean parameter matrix Γ and the non negative definite variance matrix Σ are assumed to be the same for all examinee groups. Under this setup, L(Γ, Σ Y, X), the (marginal) likelihood function for (Γ, Σ) based on the data (X, Y ), is given by n L(Γ, Σ Y, X) = f 1 (y i1 θ i1 )... f 1 (y ip θ ip )φ(θ i Γ x i, Σ)dθ i, (2) i=1 where n is the number of examinees, and φ(..,.) is the multivariate normal density function EM Algorithm The EM algorithm (e.g., Dempster et al., 1977) is commonly used to obtain maximum likelihood estimates or posterior modes in problems with missing data. In fact, the algorithm is applicable in a broad range of situations where probability models can be reexpressed as ones on augmented parameter spaces and the added parameters can be thought of as missing data. The EM algorithm is relevant for mixed models because 5
12 the random effects may be viewed as missing data and the algorithm may be used to find maximum likelihood estimates for these models. Suppose that given an unknown parameter vector ω, data y follows a distribution f(y; ω) and that y = (y obs, y mis), where y obs is the observed data and y mis is the missing data. Suppose the objective is to obtain the maximum likelihood estimate of ω, i.e., the value of ω that maximizes the observed data likelihood f(y obs ω) = f(y obs, y mis ω)dy mis. The EM algorithm goes through a succession of E-steps and M-steps starting from an initial value ω (0) of the parameter vector ω. In the (t + 1)-th E-step, one maximizes Q(ω ω (t) ) = E[log{f(y obs, y mis ω)} y obs, ω (t) ], (3) the expected value of the complete data loglikelihood, where the expectation is with respect to the distribution of y mis given y obs and ω (t). In the corresponding M-step, one maximizes Q(ω ω (t) ) computed in the preceding E-step with respect to ω to get ω (t+1). The algorithm is said to have converged when two successive iterations give very similar values of the parameter vector. It is also important to monitor the value of the loglikelihood f(y obs ω). The EM algorithm is guaranteed to converge to a local maximum so it is customary to start the iterations at many points in the parameter space and then to compare the values of the likelihood at all of the local maxima. This method may be very slow to converge if the proportion of missing data is high. For simple problems, one may be able to do the averaging in (3) analytically. However, in many practical problems (e.g., in NAEP estimation), the expectation cannot be computed analytically; then one may need to use an approximation technique or a simulation technique to compute the expectation NAEP Estimation Process and the MGROUP Program NAEP uses a three-stage estimation process for fitting the above mentioned latent regression model and making inferences. The first stage, scaling, fits a simple IRT 6
13 model (consisting of 3PL and GPCM terms) of the form in (1) to the examinee response data and estimates the item parameters. The prior distribution used in this step is not θ i x i N(Γ x i, Σ) as described above, but is θ i N(0, I), that is, the subscales are assumed to be independent a priori. The second stage, conditioning, assumes that the item parameters are fixed at the estimates found in scaling and fits the model in (2) to the data, that is, estimates Γ and Σ as a first part. In the second part of the conditioning step, plausible values for all examinees are obtained using the parameter estimates obtained in scaling and the first part of conditioning the plausible values are used to estimate examinee subgroup averages. The third stage of the NAEP estimation process, called variance estimation, estimates the variances corresponding to the examinee subgroup averages using a jackknife approach (see, e.g., Johnson & Jenkins, in press). Our research will focus on the conditioning step and assume that the scaling has already been done (i.e., the item parameters are fixed); this is the reason we suppress the dependence of (1) on the item parameters. Because we will be concerned with the conditioning step, the remaining part of the section provides a more detailed discussion of it. The first objective of this step is to estimate Γ and Σ from the data. If the θ i s were known, the maximum likelihood estimators of Γ and Σ would be θ 1 ˆΓ = (X X) 1 X θ 2, (4)... θ n ˆΣ = 1 (θ i Γ x i )(θ i Γ x i ) (5) n i However, θ i s are actually unknown. Mislevy (1984, 1985) shows that the maximum likelihood estimates of Γ and Σ under unknown θ i s can be obtained using an EM algorithm (Dempster et al., 1977). The EM algorithm iterates through a number of expectation steps (E-step) and maximization steps (M-step). The expression for (Γ t+1, Σ t+1 ), the updated 7
14 value of the parameters in the t th M-step, is obtained as: θ 1t Γ t+1 = (X X) 1 X θ 2t, (6)... θ nt [ Σ t+1 = 1 Var(θ i X, Y, Γ t, Σ t ) + n i i ] ( θ it Γ t+1 x i)( θ it Γ t+1 x i), (7) where θ it = E(θ i X, Y, Γ t, Σ t ) is the posterior mean for examinee i given the preliminary parameter estimates of iteration t. The process is repeated until convergence of the estimates Γ and Σ. Formulae (6) and (7) require the values of the posterior means E(θ i X, Y, Γ t, Σ t ) and the posterior variances Var(θ i X, Y, Γ t, Σ t ) for the examinees. Correspondingly, the t th E-step computes the two required quantities for all the examinees. The MGROUP set of programs at ETS perform the EM algorithm mentioned above. The MGROUP program consists of two primary controlling routines called PHASE1 and PHASE2. The former does some preliminary processing while the latter directs the EM iterations. There are different versions of the MGROUP program depending on the method used to perform the E-step in PHASE2: BGROUP using numerical quadrature, NGROUP (Mislevy, 1985) using Bayesian normal theory, (Thomas, 1993) using Laplace approximations, YGROUP (von Davier & Yu, 2003) using SUR, and so on. The BGROUP version of MGROUP, applied when the dimension of θ i is less than or equal to two, applies numerical quadrature to approximate the integral. When the dimension of θ i is larger than two, NGROUP,, or YGROUP may be used. Note that N-group and Y-group are not used for operational purposes for the following reasons: N-group assumes normality of both the likelihood and the prior and may produce biased results if these assumptions are not met (Thomas, 1993). YGROUP performs the estimation independently by implementing seemingly unrelated regressions (Zellner, 1962), which is useful for generating starting values, but does not capitalize on the covariances between the p components of the latent trait. This may be inappropriate, for example, if the latent trait θ i is measured with very different precisions across components θ iq for some 8
15 or all subjects. The following subsection provides details about, the approach currently used operationally in several large scale assessments including NAEP. This approach is based on approximation of posterior moments of a multivariate distribution using Laplace approximation. E-step. The version of MGROUP uses the Laplace approximation of the posterior mean and variance, that is, uses where, E (θ j X, Y, Γ t, Σ t ) ˆθ j,mode 1 2 r=1 Cov (θ j, θ k X, Y, Γ t, Σ t ) G jk 1 p (G jr G kr G rr ) 2 ĥ(4) r + r=1 1 p r {[1 12 ] 2 I(r = s) ĥ (3) r r=1 s=1 ˆθ j,mode = j th component of the posterior mode of θ, p G jr G rr ĥ (3) r, j = 1, 2,... p, (8) ĥ (3) s G rs (G js G ks G rr + G js G kr G rs + G jr G ks G rs + } G jr G kr G ss ), j, k = 1, 2,... p, (9) h = log[f(y θ)φ(θ Γ tx, Σ t )] ( h ) 1 G jk =, (10) θ r θ s ˆθmode jk ĥ n r = n th pure partial derivative withrespectto θ r, evaluated at ˆθ mode, and I(r = s) is 1 if r = s and 0 otherwise. Details about the formulae (8) and (9) can be found in Thomas (1993, pp ). 1 The Laplace method does not provide an unbiased estimate of the quantity it is approximating and may provide inaccurate results if higher order derivatives (that the Laplace method assumes to be equal to zero) are not negligible. The error of approximation for each component of the mean and covariance of θ i is of order O( 1 k 2 ) (e.g., Kass & Steffey, 1989), where k is the number of items measuring skill corresponding to the component. Because the number of items given to each examinee in large scale assessments such as 9
16 NAEP is not too large (making k rather small), the errors in the Laplace approximation may become nonnegligible, especially for high-dimensional θ i s. Further, if the posterior distribution of θ i s is multimodal (which is not impossible, especially for a small number of items), the method can perform poorly. Therefore the version of MGROUP is not entirely satisfactory. Figure 1 in Thomas (1993), where the posterior variance estimates of 500 randomly selected examinees using the Laplace method and exact numerical integration for two-dimensional θ i are plotted, shows that the Laplace method provides inflated variance estimates for examinees with large posterior variance (see also Section 6 in this report). The departure maybe more severe for θ i s in higher dimensions. 3 The Suggested Stochastic EM Method Before going into the details of the suggested technique, we provide a brief description of the importance sampling method (e.g., Geweke, 1989.). 3.1 Importance Sampling The Radon-Nikodym theorem (e.g., Bauer, 1972) forms the basis for importance sampling. This theorem states that a measure M over Ω can be expressed in terms of another measure N over Ω M(A) = A fdn A m(ω)dω = A n(ω)f(ω)dω if N(A) = 0 M(A) = 0 for all A. The function f is referred to as the Radon-Nikodym derivative and is sometimes written as f = M. In the case that m, n are probability N densities over Ω = R p the theorem implies f(ω)ωn(ω)dω = which is equivalent to E M (ω) = E N (fω). Ω Ω ωm(ω)dω, This result is applied in importance sampling to approximate intractable integrals. Suppose we want to compute the expected value of a function g(ω) where the expectation is taken with respect to a probability density f(ω). We can express the expectation of g(ω) as E f {g(ω)} = ω g(ω)f(ω)dω = 10 ω g(ω)f(ω) h(ω)dω (11) h(ω)
17 for any probability density h(ω) defined on the same sample space that is zero only if f(ω) is zero. Now, if we can generate a random sample ω 1, ω 2,..., ω n from the distribution with density h(ω), we can approximate E f {g(ω)}, using (11), as E f {g(ω)} 1 n n i=1 g( ω i )f( ω i ) (12) h( ω i ) The standard error corresponding to the above estimate is given by dividing the standard deviation of the importance ratios g( ω i)f( ω i ) h( ω i ) by n. The standard deviation mentioned above determines the quality of the approximation. The density h(ω) is called the importance sampling density. If h(ω) is chosen so that the importance ratio g(ω)f(ω) h(ω) is roughly constant over the range of possible values of ω (which will make the standard deviation of the importance ratios small), fairly accurate approximation of the integral may be obtained. In particular, h(ω) should have heavier tails than the product g(ω)f(ω) since otherwise there may be a few ω i s for which h(ω) will be much smaller than g(ω)f(ω) and the estimate will blow up. The Radon-Nikodym theorem requires h = 0 g = 0 to hold, which can be stated in A A an approximate fashion as: Wherever h is zero, g needs to be zero. This is the theoretical justification for the heavier tail requirement of h over gf. The main advantages of importance sampling are that it provides an unbiased estimate of the integral and the accuracy can be monitored by examining the standard deviation of the g( ω i)f( ω i ) h( ω i ) values. In some situations, the researcher knows a distribution up to a constant only, but still needs to compute the moments of the distribution. For example, often that all that is available in a Bayesian analysis is a multiple of the posterior (by multiplying the likelihood and the prior), but not the posterior itself. Importance sampling can be used in this situation as well. Suppose, in the above setup, one does not know f(ω), but does know q(ω) c.f(ω). The expectation of a function g(ω) is given by ω E f {g(ω)} = g(ω)q(ω)dω ω q(ω)dω 11
18 The numerator in the right-hand side of above is approximated by g(ω)q(ω) g(ω)q(ω)dω = h(ω)dω 1 n g( ω i )q( ω i ) h(ω) n h( ω i ) ω while the denominator is approximated by n q( ω i ) i=1 h( ω i ) ω i=1 (see, e.g., Gelman, Carlin, Stern, & Rubin, 2003, pp ). Geweke (1989) shows that the estimate above converges almost surely to the estimand under weak assumptions and a central limit theorem (establishing that n 1/2 ( Ef {g(ω)} E f {g(ω)}) N (0, τ 2 ), where τ 2 can be estimated consistently) applies under stronger assumptions. The estimation of ratio of two quantities by the ratio of estimates of them is known as ratio estimation. For more details about ratio estimation, see, for example, Cochran (1977). 3.2 Importance Sampling in the E-step of the MGROUP Program This paper suggests approximating the posterior expectation and variance of the examinee proficiencies θ i in (6) and (7) using the importance sampling method. The posterior distribution of θ i, denoted as p(θ i X, Y, Γ t, Σ t ), is given by p(θ i X, Y, Γ t, Σ t ) f(y i1 θ i1 )... f(y ip θ ip )φ(θ Γ t x i, Σ t ) (13) using (2). The proportionality constant in (13) is a function of y i, Γ t, and Σ t. Let us denote q(θ i X, Y, Γ t, Σ t ) f(y i1 θ i1 )... f(y ip θ ip )φ(θ Γ t x i, Σ t ). (14) We drop the subscript i for convenience for the rest of the section and let θ denote the proficiency of an examinee. We have to compute the mean and variance of p(θ X, Y, Γ t, Σ t ), that is, the quantities E(θ X, Y, Γ t, Σ t ) θp(θ X, Y, Γ t, Σ t )dθ, and. (15) Var(θ X, Y, Γ t, Σ t ) (θ E(θ X, Y, Γ t, Σ t ))(θ E(θ X, Y, Γ t, Σ t )) p(θ X, Y, Γ t, Σ t )dθ (16) 12
19 Using ideas from the above description of the importance sampling method, if we can generate a random sample θ 1, θ 2,..., θ n from a distribution h(θ) approximating p(θ X, Y, Γ t, Σ t ) reasonably, we can approximate (15) by the ratio of and 1 n 1 n n θ j q(θ j X, Y, Γ t, Σ t ) j=1 h(θ j ) n q(θ j X, Y, Γ t, Σ t ) h(θ j. ) j=1 Similarly, we can approximate (16) as the ratio of 1 n n j=1 (θ j E(θ j X, Y, Γ t, Σ t ))(θ j E(θ j X, Y, Γ t, Σ t )) q(θ j X, Y, Γ t, Σ t ) h(θ j ) and. 1 n n q(θ j X, Y, Γ t, Σ t ) j=1 h(θ j ) Following Booth and Hobert (1999), who apply the importance sampling EM method for fitting generalized linear mixed models, we use a multivariate t importance sampling density. Booth and Hobert (1999) argue that in general, a multivariate t density is a better choice as an importance sampling density than a normal density. One reason is that the former is heavy-tailed while the latter is not, and that a good importance sampling density should be heavy-tailed when the density it approximates may be so; also, the t density is easy to sample from. The degrees of freedom of the t density is taken to be four because low degrees of freedom of a t density ensures that it has a heavy tail. (The choice of optimal degrees of freedom is an issue requiring further investigation.) The mean and variance of the t importance sampling density are taken as input from a nonstochastic version of MGROUP. In our implementation, the YGROUP version (von Davier & Yu, 2003) is used, which conveniently provides input for the importance sampling density in the stochastic EM algorithm. It is also possible to estimate variances of the parameter estimates obtained from the EM algorithm as suggested by, for example, Booth and Hobert (1999). However, we are 13
20 more interested in the estimation of the latent regression parameters in this work and do not delve into the issue here. Note that a number of studies have successfully applied other forms of stochastic EM methods, such as rejection sampling EM (Booth & Hobert, 1999) and Markov chain Monte Carlo EM, or MCMC-EM, (e.g., McCulloch, 1994), but we do not explore those approaches, mainly because of the difficulties in their application to our problem. 4 MCEM in MGROUP This section describes the implementation of (MCEM is short for Markov chain Monte Carlo EM), the version of MGROUP that uses importance sampling to generate the necessary statistics in the E-Step. Because of the presence of a stochastic component in the E-step, this technique is also called stochastic EM algorithm. Following the notes on the implementation, differences between assessing convergence in nonstochastic optimization and in stochastic optimization algorithms will be discussed. 4.1 Implementing The MCEM algorithm was integrated into YGROUP (von Davier, 2003), which is based on BGROUP and implements SUR (Zellner, 1962) and various other modifications. Following Booth and Hobert (1999), importance sampling is used in the E-step to generate the statistics (here, posterior means, residual components, and measurement error components) necessary to perform the EM iterations for the latent regression. The implementation uses importance sampling, employing a multivariate t (as in Booth & Hobert, 1999). The version as implemented in YGROUP uses the SUR posterior means and the posterior residual (co-)variance matrix as fast and efficient means to compute the mean and variance of the importance sampling density. Like in all versions of MGROUP, optional starting values for the latent regression (Γ, Σ) can be provided but are not essential. As an alternative, initial iterations with a nonstochastic E-step of one of the other versions can be carried out to generate starting values. The initial sample size of the importance sample can be chosen to be much smaller 14
21 than the final sample size, in order to speed up initial iterations where individual accuracy is less important than approaching a good starting value for the aggregate parameters. The examples below use 500 as initial importance sample size and 6,000 as final importance sample size. 4.2 Determining Convergence of the MCEM Algorithm When applying a stochastic EM algorithm, it is possible that the EM algorithm has not converged, but appears so after an EM step (i.e., updated parameters and likelihoods do not change much from the previous step) because of an unlucky random sample. One should therefore use a stricter convergence criterion, or a larger variety of criteria that all have to be met simultaneously, in order to ensure the convergence of the algorithm. We monitor the likelihood function and the parameter vector for convergence and conclude that convergence has occurred only when the relative change in both these quantities are less than ɛ in five successive iterations. The likelihood increases over the iterations in a nonstochastic EM algorithm (see, e.g., Dempster et al., 1977) while it may not be so for a stochastic EM algorithm because of Monte Carlo error. In order to assess convergence in stochastic optimization algorithms, the inevitable nonmonotonicity of these algorithms has to be taken into account. Each E-step in our implementation employs a different set of points (the importance sample) in the multivariate quadrature grid in order to perform the integration. This means that the likelihood of each observed response vector may vary, and hence, the likelihood of the whole sample may vary even if all other parameters are unchanged, from one importance sample to the next. Therefore, in our implementation, we monitor absolute changes in the log likelihood, the regression parameters Γ, and the residual (co-)variance matrix Σ simultaneously. The maximum absolute changes are checked against user-defined stopping criteria, which are used to decide whether a significant change has occurred that demands further iterations towards the optimum. The algorithm stops if, and only if, all three absolute changes are smaller than user-defined numbers. In addition to that, the absolute changes of previous 15
22 iterations are integrated with the current change by using the following approach: For cycle t, define the average absolute maximum change (AAMC) as AAMC x,t = p AMC x,t + (1 p) AAMC x,t 1, which averages the current absolute maximum change (AM C) and the previous average absolute maximum change (AAM C). The x denotes a parameter (vector) or a function of parameters (for example, the maximum change in regression parameters in our case). t denotes the current iteration (cycle) of the algorithm; t 1, the previous cycle; and 0 (1 p) 1.0, the weight of the averaged previous cycles criterion. This criterion ensures that, if p < 1, more than the current change (which might be small due to the stochastic nature of the algorithm) is taken into account when deciding whether to stop iterations. If p = 1, we have a regular (no memory) stopping criterion, which stops whenever the current iteration alone meets the stopping rule. In the current implementation, the stopping rule fades out the past absolute changes rather slowly. We found that p = 0.4 and the AAMC bound of for relative change in likelihood, regression weights, and variances works reasonably well. This choice balances infinite iterations against premature termination of the algorithm in the examples presented here. The rule for stopping the algorithm is that AAMC x,t < has to be satisfied for termination, a maximum change bound similar to what Booth and Hobert (1999) report. 5 Analysis of Simulated Data This section presents the first proof-of-concept example. The simulated data set used for this example matches the structure and size of the 2000 NAEP mathematics assessment at grade 4 and comes from a previous study (von Davier & Yu, 2003). The assessment has five scales: (a) Number and Operations, (b) Measurements, (c) Geometry, (d) Data Analysis, and (e) Algebra. This section applies the stochastic EM method to find the parameter estimates for the simulated data set, which involves responses of 13,511 students. Each student responds to 1 of 26 sets of items (booklets). Each booklet contains three blocks of items (both multiple-choice and constructed-response) out of a total 145 items. Each examinee has information on 381 predictors that are used in the latent 16
23 regression model. This section compares estimates from the and the operational of the posterior means and standard deviations of examinees, the regression parameters Γ, and the residual variance-covariance matrix Σ. As an additional benchmark, both the results of and are compared against the generating values, that is, the individual latent trait vectors used to simulate the data. 5.1 Estimates of Γ and Σ for the Simulated Data Figure 1 shows a comparison of the and regression parameters estimates (i.e., estimates of the individual parameters in Γ) for the 2000 NAEP mathematics assessment at grade 4 by subscale. The figure points to the extreme closeness of the estimates from the two approaches. Table 1 shows the residual variance-covariance estimates (i.e., estimates of the individual parameters in Σ) from and for the simulated data set. The table shows slightly larger estimated variances for the MCEM version of MGROUP as compared to. The pattern is the opposite for the estimated covariances. The differences between the covariance estimates of and seem smaller than the differences between the variance estimates of the two approaches. 5.2 Subgroup Estimates for the 2000 NAEP Mathematics Assessment at Grade 4 Figures 2 to 4 present recovery plots for and. The figures also show subplots comparing the individual means and standard deviations from and. Each subscale has four subplots. The recovery plots are based on individual examinee statistics, and hence will show more variance than that in the population parameters Γ and Σ. In the recovery plots, the generating values are the reference, and the conditional posterior moments (given the responses, the background information, and the maximum likelihood estimates of Γ and Σ, not the random draws 17
24 All Gamma NUM&OP All Gamma MEASUREMENT All Gamma GEOMETRY All Gamma DATA ANALYSIS All Gamma ALGEBRA Figure 1. Regression parameter (Γ) estimates from and for the data for 2000 NAEP mathematics assessment at grade 4. of these parameters used when generating imputations with MGROUP) as generated by and are plotted against these reference values. For all subscale displays, the ideal recovery or agreement line, respectively, is indicated by the main diagonal given in each plot. Figures 2 to 4 show no obvious differences between and when comparing the plots showing recovery of the generating θ values. This observation finds additional support when examining the plot of both posterior mean estimates. The posterior means generated by and fall nicely around the diagonal line and the plot is much narrower than the truth recovery plots. This indicates that and agree quite well when producing posterior means, a statement that cannot be said for the posterior standard deviation. 18
25 Num&Op Posterior mean Num&Op Posterior mean True Theta True Theta Num&Op Posterior mean Num&Op Posterior SD Measurement Posterior mean Measurement Posterior mean True Theta True Theta Measurement Posterior mean Measurement Posterior SD Figure 2. Posterior moments from and compared against the generating values and the posterior means and standard deviations from and compared against each other. 19
26 Geometry Posterior mean Geometry Posterior mean True Theta Geometry Posterior mean True Theta Geometry Posterior SD Data Analysis Posterior mean Data Analysis Posterior mean True Theta Data Analysis Posterior mean True Theta Data Analysis Posterior SD Figure 3. Posterior moments from and compared against the generating values and the posterior means and standard deviations from and compared against each other. 20
27 Table 1. Residual Variance Covariance Estimates From and NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA estimates NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA estimates NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA Note. Based on the operational population model for the 2000 NAEP mathematics assessment at grade 4. The posterior standard deviations given by are considerably larger than for for all subscales. It is to be determined whether this is a systematic difference, whether it is due to Monte Carlo inflation of the variance, or whether this depends on some other structural variable such as the correlation between scales or the sample size. Table 5.2 compares the subgroup means from and for relevant subgroups there seems to be little difference between the two methods from this aspect as well. Table 5.2 results show that does not perform much differently from when reproducing group level statistics such as subgroup means and variances. 21
28 Algebra Posterior mean Algebra Posterior mean True Theta True Theta Algebra Posterior mean Algebra Posterior SD Figure 4. Posterior moments from and compared against the generating values and the posterior means and standard deviations from and compared against each other. 6 Analysis of Real NAEP Data: 2003 Reading Assessment at Grade 4 The 2003 NAEP reading assessment at Grade 4 has data from 187,581 examinees in the fourth grade or 9 years of age. Each student answers a literary block (to assess the ability to read for literary experience) and an informative block (to assess the ability to read for information). Each block consists of 10 to 12 items, with a mixture of multiple-choice and constructed-response items. We will refer to the two skills (subscales) measured by the assessment as literary and information. The combined item sample for these two scales has 111 items in total; 688 background variables are used in the latent regression model. Because the θ i s are two-dimensional, it is possible to apply BGROUP (which performs exact numerical integration in the E-step) here. BGROUP provides the gold standard, and we will compare the results obtained from the application of the stochastic EM against it. Comparison with results from will be provided as well. 22
29 Table 2. Comparison of Subgroup Estimates From and The subgroup NumOp Meas Geom DA Alg NumOp Meas Geom DA Alg Overall Male Female White Black Hispanic Asian Amerind Estimates of Γ and Σ for the Real Data Table 3 shows the residual variance estimates ˆΣ as generated by BGROUP,, and for the NAEP data. Table 3. Residual Variances, Covariances, and Correlations for the 2003 NAEP Reading Assessment at Grade 4 Data BGROUP Lit. Info. Lit. Info. Lit. Info. Lit Info Note. Lit. = Literary, Info. = Information, Residual variances = main diagonals, covariances = upper off-diagonals,and correlations = lower off-diagonals. The three estimates for the residual correlation and the residual variances are close. Interestingly, the estimates generated by are slightly smaller than 23
30 estimates for the reading data, whereas the residual variances estimates for the simulated math were larger for than for. As an outcome, the correlation is higher for than for. Whether this is an effect of the sample size or the number of scales needs further investigation. Figure 5 shows the regression parameter estimates as generated by and plotted against the estimates generated by the gold standard for this data example, BGROUP All Gamma LITERARY BGROUP All Gamma INFORMATION BGROUP All Gamma LITERARY All Gamma INFORMATION BGROUP BGROUP Figure 5. Regression coefficients for the 2003 NAEP reading assessment at grade 4. The figure shows that the 688 regression coefficient estimates each for the literary and the information scales agree to a large extent between the different versions of MGROUP. Almost all points are found on the main diagonal, which indicates that the ˆΓ estimates produced by BGROUP,, and are virtually identical. 6.2 Subgroup Estimates for the 2003 NAEP Reading Assessment at Grade 4 Figure 6.2 presents plots of the posterior means as generated by and compared to the posterior means generated by BGROUP. Figure 7 24
31 LITERARY Posterior mean LITERARY Posterior mean LITERARY Posterior mean BGROUP BGROUP Information Posterior mean Information Posterior mean Information Posterior mean BGROUP BGROUP Figure 6. Posterior means as generated by and and plotted against the reference, BGROUP, for the NAEP 2003 grade 4 reading data. compares the corresponding posterior standard deviations. In contrast to the first example (that involving simulated data), there is no obvious large difference between the estimates of the posterior means or standard deviations (SDs) generated by and. Both approaches differ slightly from the estimates generated by BGROUP. overestimates the posterior standard deviation for larger values when compared to BGROUP. The graph comparing BGROUP and posterior SDs shows this effect in a very similar way to what Thomas (1993) reports in his Figure 1., on the other hand, stays closer to BGROUP for large posterior standard deviations than does. For small standard deviations, seems to produce slightly larger values than BGROUP, while the estimates generated by and BGROUP agree better with smaller standard deviations. Except for a few outliers, the posterior means as generated by agree nicely with the values produced by both BGROUP and. There is no obvious 25
32 LITERARY Posterior SD LITERARY Posterior SD LITERARY Posterior SD BGROUP BGROUP Information Posterior SD Information Posterior SD Information Posterior SD BGROUP BGROUP Figure 7. Posterior standard deviations as generated by and MCEM- GROUP and plotted against the reference, BGROUP, for the 2003 NAEP reading assessment at grade 4 data. indication of systematic over- or underestimation for the vast majority of subjects. The obvious outliers seen in these plots are subject to ongoing research aimed at optimizing and stabilizing the current implementation of. It will be investigated whether the stochastic integration failed to produce meaningful results due to poor coverage of the true posterior by the importance sampling distribution or whether these values are caused by small size of the importance sample, or for other less obvious reasons. Table 4 shows the subgroup estimates and the corresponding standard deviations (in parenthesis) provided by the three methods. There is hardly any differences between and so far as subgroup estimates are concerned. However, when compared to BGROUP results, both and slightly underestimate the subgroup means. 26
33 Table 4. Comparison of Subgroup Estimates From BGROUP,, and Pop. BGROUP sub. Lit. Info. Lit. Info. Lit. Info. Overall (0.95) (0.98) (0.96) (0.99) (0.94) (0.98) Male (0.96) (1.00) (0.96) (1.01) (0.94) (0.99) Female (0.94) (0.97) (0.94) (0.98) (0.93) (0.96) White (0.87) (0.89) (0.87) (0.90) (0.86) (0.89) Black (0.92) (0.92) (0.92) (0.93) (0.91) (0.91) Hisp (0.94) (0.94) (0.94) (0.95) (0.93) (0.93) Asian (0.93) (0.97) (0.94) (0.98) (0.92) (0.96) AmInd (0.94) (0.94) (0.95) (0.95) (0.92) (0.93) Public (0.96) (0.99) (0.96) (1.00) (0.94) (0.98) Private (0.83) (0.87) (0.83) (0.88) (0.82) (0.86) Note. Pop. sub. = Population subgroup, Lit. = Literary, Info. = Information. Hisp. = Hispanic. Asian = Asian Pacific, AmInd = American Indian. 7 Conclusions and Future Work is the current operational method used in large scale assessments such as NAEP. Though provides more accurate results than its predecessor (NGROUP), it is not without problems, as demonstrated by Thomas (1993); especially, is found to inflate variance estimates for examinees with large posterior variances. As of now, there is no entirely satisfactory alternative to. As this work shows, a stochastic EM method using importance sampling provides a viable alternative to. Application of the importance sampling EM method to a simulated data set and a real data set reveals interesting facts. The regression parameter estimates are extremely close for BGROUP (the gold standard), (the current operational version of MGROUP used in NAEP, etc.), and the newly proposed. The estimates of conditional posterior means seem 27
34 to be quite close for and. In the real data example, both of these set of estimates are very close to the BGROUP estimates. The MCEM algorithm produced a few outlying estimated conditional posterior means, especially in the real data example, that may be due to MC error; this issue needs further investigations. Statements about agreement between posterior standard deviations cannot be clearly made as of now. For the real data example, which has a large sample size, provides inflated estimates of posterior SDs for examinees with large SDs, while the does not. However, results in a few outliers in the middle of the range of the SDs. Interestingly, the pattern is quite different for the simulated data, where the estimates of posterior SDs are bigger than those from. Overall, the stochastic EM method performs slightly better than the version of MGROUP. One problem with the stochastic EM method is that it is very time-consuming. For example, for our real data example (2003 reading data), while the BGROUP takes 8 hours and takes 4 hours on a Pentium IV computer with 2.2 Ghz, takes about 48 hours, which is longer than by a factor of 12. However, with the advent of increasingly fast computers, application of will become more feasible; at least this method can be used to get a second opinion right now. It is possible to improve the efficiency of the importance sampling EM method even further. Let us consider estimation of E(θ X, Y, Γ t, Σ t ), which is given in (15). Let us denote θ 0 to be the mode of p(θ X, Y, Γ t, Σ t ). We have, E(θ X, Y, Γ t, Σ t ) θp(θ X, Y, Γ t, Σ t )dθ θexp{u(θ)}dθ, for exp{u(θ)} p(θ X, Y, Γ t, Σ t ) θexp{u(θ 0 ) (θ θ 0) u (θ 0 )(θ θ 0 ) + (θ)}dθ as u (θ 0 ) 0 { } e u(θ 0) 1 θexp 2 (θ θ 0) u (θ 0 )(θ θ 0 ) exp( (θ))dθ where (θ) = u(θ) u(θ 0 ) 1 2 (θ θ 0) u (θ 0 )(θ θ 0 ) The first term under exponentiation in the last step above forms a normal distribution and (θ) is much less variable (and close to zero) over the range of θ than is u(θ). Therefore, using the above normal distribution as the importance sampling density 28
Extension of the NAEP BGROUP Program to Higher Dimensions
Research Report Extension of the NAEP Program to Higher Dimensions Sandip Sinharay Matthias von Davier Research & Development December 2005 RR-05-27 Extension of the NAEP Program to Higher Dimensions Sandip
More informationStochastic Approximation Methods for Latent Regression Item Response Models
Research Report Stochastic Approximation Methods for Latent Regression Item Response Models Matthias von Davier Sandip Sinharay March 2009 ETS RR-09-09 Listening. Learning. Leading. Stochastic Approximation
More informationPIRLS 2016 Achievement Scaling Methodology 1
CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally
More informationPlausible Values for Latent Variables Using Mplus
Plausible Values for Latent Variables Using Mplus Tihomir Asparouhov and Bengt Muthén August 21, 2010 1 1 Introduction Plausible values are imputed values for latent variables. All latent variables can
More informationAssessment of fit of item response theory models used in large scale educational survey assessments
DOI 10.1186/s40536-016-0025-3 RESEARCH Open Access Assessment of fit of item response theory models used in large scale educational survey assessments Peter W. van Rijn 1, Sandip Sinharay 2*, Shelby J.
More informationA multivariate multilevel model for the analysis of TIMMS & PIRLS data
A multivariate multilevel model for the analysis of TIMMS & PIRLS data European Congress of Methodology July 23-25, 2014 - Utrecht Leonardo Grilli 1, Fulvia Pennoni 2, Carla Rampichini 1, Isabella Romeo
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationChapter 11. Scaling the PIRLS 2006 Reading Assessment Data Overview
Chapter 11 Scaling the PIRLS 2006 Reading Assessment Data Pierre Foy, Joseph Galia, and Isaac Li 11.1 Overview PIRLS 2006 had ambitious goals for broad coverage of the reading purposes and processes as
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationGeneralized Linear Models for Non-Normal Data
Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture
More informationIntroduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016
Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An
More informationExploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement
Exploiting TIMSS and PIRLS combined data: multivariate multilevel modelling of student achievement Second meeting of the FIRB 2012 project Mixture and latent variable models for causal-inference and analysis
More informationBasics of Modern Missing Data Analysis
Basics of Modern Missing Data Analysis Kyle M. Lang Center for Research Methods and Data Analysis University of Kansas March 8, 2013 Topics to be Covered An introduction to the missing data problem Missing
More informationHmms with variable dimension structures and extensions
Hmm days/enst/january 21, 2002 1 Hmms with variable dimension structures and extensions Christian P. Robert Université Paris Dauphine www.ceremade.dauphine.fr/ xian Hmm days/enst/january 21, 2002 2 1 Estimating
More information7. Estimation and hypothesis testing. Objective. Recommended reading
7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing
More informationThe Bayesian Approach to Multi-equation Econometric Model Estimation
Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation
More informationFractional Imputation in Survey Sampling: A Comparative Review
Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical
More informationBayesian Networks in Educational Assessment
Bayesian Networks in Educational Assessment Estimating Parameters with MCMC Bayesian Inference: Expanding Our Context Roy Levy Arizona State University Roy.Levy@asu.edu 2017 Roy Levy MCMC 1 MCMC 2 Posterior
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationEquating Subscores Using Total Scaled Scores as an Anchor
Research Report ETS RR 11-07 Equating Subscores Using Total Scaled Scores as an Anchor Gautam Puhan Longjuan Liang March 2011 Equating Subscores Using Total Scaled Scores as an Anchor Gautam Puhan and
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning
More informationModel comparison. Christopher A. Sims Princeton University October 18, 2016
ECO 513 Fall 2008 Model comparison Christopher A. Sims Princeton University sims@princeton.edu October 18, 2016 c 2016 by Christopher A. Sims. This document may be reproduced for educational and research
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationItem Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions
R U T C O R R E S E A R C H R E P O R T Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions Douglas H. Jones a Mikhail Nediak b RRR 7-2, February, 2! " ##$%#&
More informationA Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions
A Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions Cees A.W. Glas Oksana B. Korobko University of Twente, the Netherlands OMD Progress Report 07-01. Cees A.W.
More informationA Note on the Choice of an Anchor Test in Equating
Research Report ETS RR 12-14 A Note on the Choice of an Anchor Test in Equating Sandip Sinharay Shelby Haberman Paul Holland Charles Lewis September 2012 ETS Research Report Series EIGNOR EXECUTIVE EDITOR
More informationFitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation
Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation Dimitris Rizopoulos Department of Biostatistics, Erasmus University Medical Center, the Netherlands d.rizopoulos@erasmusmc.nl
More informationChained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data
Research Report Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data Gautam Puhan February 2 ETS RR--6 Listening. Learning. Leading. Chained Versus Post-Stratification
More informationMetropolis-Hastings Algorithm
Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationFor more information about how to cite these materials visit
Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/
More informationIntroduction to Bayesian Computation
Introduction to Bayesian Computation Dr. Jarad Niemi STAT 544 - Iowa State University March 20, 2018 Jarad Niemi (STAT544@ISU) Introduction to Bayesian Computation March 20, 2018 1 / 30 Bayesian computation
More informationPenalized Loss functions for Bayesian Model Choice
Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented
More informationParametric fractional imputation for missing data analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????),??,?, pp. 1 15 C???? Biometrika Trust Printed in
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationBayesian model selection: methodology, computation and applications
Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program
More informationMaking the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring
Journal of Educational and Behavioral Statistics Fall 2005, Vol. 30, No. 3, pp. 295 311 Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationStatistical Estimation
Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from
More informationBayes: All uncertainty is described using probability.
Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w
More informationEco517 Fall 2013 C. Sims MCMC. October 8, 2013
Eco517 Fall 2013 C. Sims MCMC October 8, 2013 c 2013 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained
More informationMonte Carlo in Bayesian Statistics
Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview
More informationReminder of some Markov Chain properties:
Reminder of some Markov Chain properties: 1. a transition from one state to another occurs probabilistically 2. only state that matters is where you currently are (i.e. given present, future is independent
More informationHeriot-Watt University
Heriot-Watt University Heriot-Watt University Research Gateway Prediction of settlement delay in critical illness insurance claims by using the generalized beta of the second kind distribution Dodd, Erengul;
More informationStatistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23
1 / 23 Statistical Methods Missing Data http://www.stats.ox.ac.uk/ snijders/sm.htm Tom A.B. Snijders University of Oxford November, 2011 2 / 23 Literature: Joseph L. Schafer and John W. Graham, Missing
More informationUse of e-rater in Scoring of the TOEFL ibt Writing Test
Research Report ETS RR 11-25 Use of e-rater in Scoring of the TOEFL ibt Writing Test Shelby J. Haberman June 2011 Use of e-rater in Scoring of the TOEFL ibt Writing Test Shelby J. Haberman ETS, Princeton,
More informationA Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts
A Study of Statistical Power and Type I Errors in Testing a Factor Analytic Model for Group Differences in Regression Intercepts by Margarita Olivera Aguilar A Thesis Presented in Partial Fulfillment of
More informationScaling Methodology and Procedures for the TIMSS Mathematics and Science Scales
Scaling Methodology and Procedures for the TIMSS Mathematics and Science Scales Kentaro Yamamoto Edward Kulick 14 Scaling Methodology and Procedures for the TIMSS Mathematics and Science Scales Kentaro
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant
More informationMODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY
ECO 513 Fall 2008 MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY SIMS@PRINCETON.EDU 1. MODEL COMPARISON AS ESTIMATING A DISCRETE PARAMETER Data Y, models 1 and 2, parameter vectors θ 1, θ 2.
More informationDirect Simulation Methods #2
Direct Simulation Methods #2 Econ 690 Purdue University Outline 1 A Generalized Rejection Sampling Algorithm 2 The Weighted Bootstrap 3 Importance Sampling Rejection Sampling Algorithm #2 Suppose we wish
More informationEstimating the parameters of hidden binomial trials by the EM algorithm
Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationTesting Restrictions and Comparing Models
Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationBayesian Nonparametric Rasch Modeling: Methods and Software
Bayesian Nonparametric Rasch Modeling: Methods and Software George Karabatsos University of Illinois-Chicago Keynote talk Friday May 2, 2014 (9:15-10am) Ohio River Valley Objective Measurement Seminar
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationChoice of Anchor Test in Equating
Research Report Choice of Anchor Test in Equating Sandip Sinharay Paul Holland Research & Development November 2006 RR-06-35 Choice of Anchor Test in Equating Sandip Sinharay and Paul Holland ETS, Princeton,
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationStatistical Data Analysis Stat 3: p-values, parameter estimation
Statistical Data Analysis Stat 3: p-values, parameter estimation London Postgraduate Lectures on Particle Physics; University of London MSci course PH4515 Glen Cowan Physics Department Royal Holloway,
More informationModel Estimation Example
Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions
More informationA note on multiple imputation for general purpose estimation
A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume
More informationBAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA
BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci
More information7. Estimation and hypothesis testing. Objective. Recommended reading
7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing
More informationLecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis
Lecture 3 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationSTA 216, GLM, Lecture 16. October 29, 2007
STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural
More informationUnivariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation
Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report. Hierarchical Cognitive Diagnostic Analysis: Simulation Study
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 38 Hierarchical Cognitive Diagnostic Analysis: Simulation Study Yu-Lan Su, Won-Chan Lee, & Kyong Mi Choi Dec 2013
More information(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis
Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals
More informationLikelihood-based inference with missing data under missing-at-random
Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim Joint work with Shu Yang Department of Statistics, Iowa State University May 4, 014 Outline 1. Introduction. Parametric
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationRecent Advances in the analysis of missing data with non-ignorable missingness
Recent Advances in the analysis of missing data with non-ignorable missingness Jae-Kwang Kim Department of Statistics, Iowa State University July 4th, 2014 1 Introduction 2 Full likelihood-based ML estimation
More informationThe Metropolis-Hastings Algorithm. June 8, 2012
The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings
More informationBasic IRT Concepts, Models, and Assumptions
Basic IRT Concepts, Models, and Assumptions Lecture #2 ICPSR Item Response Theory Workshop Lecture #2: 1of 64 Lecture #2 Overview Background of IRT and how it differs from CFA Creating a scale An introduction
More informationBayesian Analysis of Latent Variable Models using Mplus
Bayesian Analysis of Latent Variable Models using Mplus Tihomir Asparouhov and Bengt Muthén Version 2 June 29, 2010 1 1 Introduction In this paper we describe some of the modeling possibilities that are
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationAn Introduction to Path Analysis
An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving
More informationEquating of Subscores and Weighted Averages Under the NEAT Design
Research Report ETS RR 11-01 Equating of Subscores and Weighted Averages Under the NEAT Design Sandip Sinharay Shelby Haberman January 2011 Equating of Subscores and Weighted Averages Under the NEAT Design
More informationVariational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data
for Parametric and Non-Parametric Regression with Missing Predictor Data August 23, 2010 Introduction Bayesian inference For parametric regression: long history (e.g. Box and Tiao, 1973; Gelman, Carlin,
More informationBayesian Mixture Modeling
University of California, Merced July 21, 2014 Mplus Users Meeting, Utrecht Organization of the Talk Organization s modeling estimation framework Motivating examples duce the basic LCA model Illustrated
More informationBayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference
1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE
More informationDAG models and Markov Chain Monte Carlo methods a short overview
DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationStat 451 Lecture Notes Monte Carlo Integration
Stat 451 Lecture Notes 06 12 Monte Carlo Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 6 in Givens & Hoeting, Chapter 23 in Lange, and Chapters 3 4 in Robert & Casella 2 Updated:
More informationLINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006
LINKING IN DEVELOPMENTAL SCALES Michelle M. Langer A thesis submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Master
More informationSTAT 425: Introduction to Bayesian Analysis
STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationMH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution
MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous
More informationStatistical Distribution Assumptions of General Linear Models
Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions
More informationStatistical and psychometric methods for measurement: G Theory, DIF, & Linking
Statistical and psychometric methods for measurement: G Theory, DIF, & Linking Andrew Ho, Harvard Graduate School of Education The World Bank, Psychometrics Mini Course 2 Washington, DC. June 27, 2018
More informationFractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling
Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More information2 Bayesian Hierarchical Response Modeling
2 Bayesian Hierarchical Response Modeling In the first chapter, an introduction to Bayesian item response modeling was given. The Bayesian methodology requires careful specification of priors since item
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationInvestigation into the use of confidence indicators with calibration
WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings
More information