AMERICAN INSTITUTES FOR RESEARCH

Size: px

Start display at page:

Download "AMERICAN INSTITUTES FOR RESEARCH"

Bryan Baldric Taylor
5 years ago
Views:

1 AMERICAN INSTITUTES FOR RESEARCH LINKING RASCH SCALES ACROSS GRADES IN CLUSTERED SAMPLES Jon Cohen, Mary Seburn, Tamas Antal, and Matthew Gushta American Institutes for Research May 23, THOMAS JEFFERSON STREET, NW WASHINGTON, DC FAX WEBSITE

2 . Introduction Researchers and practioners share an interest in measurin student learnin over time. A sinle metric for student proficiency across rades would provide the yardstick by which student learnin could be monitored and teacher, school and proram effectiveness could be measured. Interest in, and controversy around, such vertical scales has endured for decades (Cronbach & Furby, 970; Linn & Slinde, 977; Roosa, Brandt, & Zimowski, 982). Much of the controversy arises from the fact that curricula differ across rades and the nature of the construct bein measured by such a vertical scale may vary alon the dimension. In practice, vertical scales suffer from instability it is common to find that different methods result in different inferences about rowth over time (Harris, 99; Kolen, 98; Skas & Lissitz, 988). This study is desined to compare two different vertical linkin methods in terms of the accuracy and precision of the estimators and the availability of adequate standard error estimators for realistic data. Most often, vertical linkin is accomplished throuh a common-item, nonequivalent roups desin (Muraki, Hombo, & Lee, 2000; Kolen & Brennan, 2004). Under these desins, tests for adjacent rades include a set of common items that allows the scales to be linked across rades. With these data in hand, psychometricians often use one of two methods based on Item Response Theory (IRT, Rasch, 960; Lord, 980; Hambleton & Swaminathan, 985) to link the scales across rades: Joint, or concurrent, calibration, in which the items from all of the overlappin forms of the test are calibrated toether, resultin in a sinle cross-rade scale; Chain linkin, in which each rade is calibrated separately, and the resultin item parameters of the items shared across rades are fixed to common values or estimates of them are placed on a common scale. The literature on the choice between these two eneral approaches is inconclusive at best. Some studies have found that separate estimation improves the fit between the data and the model (e.., Karakee, Lewis, Hoskens, & Yao, 2003; Kim & Cohen, 998). Separate estimation, however, doubles the number of parameters estimated, reducin the derees of freedom and necessarily improvin the fit to the particular dataset. These studies did not evaluate whether the improved fit exceeded the improvement that would arise naturally from the reduction in the derees of freedom. Peterson, Cook and Stockin (983) report that concurrent calibration produced more stable estimates, as did Hanson and Béuin (2002) under equivalent roup desins. However, under nonequivalent roup desins (as found in vertical linkin studies), Hanson and Béuin report better estimates from separate estimation. They also find that different software produces different results and that the estimates are sensitive to ancillary specifications, such as prior distributions imposed to constrain parameter estimates. American Institutes for Research

3 Kolen and Brennan (2004) suest that when the IRT model holds, concurrent calibration should produce more stable results because it uses all the available information to estimate the parameters. However, they prefer separate estimation when assumptions are violated. Taken toether, this research suests that separate calibration may provide a better fit to the data, but may do so by matchin particular data sets and reduced eneralizability. Here, we compare the methods when the data is multidimensional and arises from a complex sample. 2. A Model of Item Response This study beins by formalizin the processes presumed to enerate student responses to items in the real world. Item response theory provides a convenient foundation on which to build this model. We bein with a simple exposition of item response models and then extend that model: to reflect the fact that students are oranized into somewhat homoenous schools and classrooms; to capture the impact of chanin curricula over time. 2 American Institutes for Research

4 A Basic Model of Item Response Classical measurement theory beins with a linear model y ij = θ i + eij, where y ij is person i s response to a continuous item or test j, and e ij is the individual, item-specific measurement error of the response. if yij > b j If we have a binary measure in place of y ij we observe that zij =, where b j is a 0 otherwise threshold alon the true-score dimension. With this, the relationship between ability and item response can be stated as a probability p( z ij = θ ) = p( θ + e i i = p( e ij ij > b ) < θ b ). i j j () This relationship forms the basis of most item response theory (IRT) models. For example, if e is distributed standard loistic, we have the familiar Rasch model p( z ij θ. (2) + e = i ) = ( θi b j ) This development explicitly reconizes item response as a function of the error term e ij. Extension to Clustered Data In the real world, students are oranized into schools, and the averae proficiency of students varies across schools and classrooms. In addition, the instruction that students receive also varies by school or classroom. Both of these forces can have an impact on item response. Consider a hih-achievin school the averae proficiency (θ ) will be relatively hih, resultin in relatively hih probabilities of correct responses on the items. More subtly, consider a fourth-rade class in which the teacher enjoys teachin the multiplication of fractions, so she teaches it early and often. Her students will likely perform well on this type of item relative to other mathematics items. Therefore, we should expect to observe different patterns of performance from other teachers and other schools. The first process (the clusterin of students of similar ability within schools or classrooms) suests that the structure of θ is clustered. For example, θ ik = w k + wik where i indexes examinees and k indexes the schools or classrooms that make up the clusters. The second process difference in classroom curriculum, instruction or timin may show up as a clusterin of the measurement error, e = u + u. (3) ijk jk ijk 3 American Institutes for Research

5 Extension to Vertical Scales Vertically linked scales enerally rest on the assumption that task demands in subsequent rades are simply harder versions of the earlier rades tasks. Althouh curricula do tend to be vertically articulated, new skills and knowlede are introduced and tauht at later rade levels, whereas other skills are mastered and no loner tauht or tested. Test blueprints typically reflect these shifts, with the resultin vertical scales measurin a slihtly different trait at each rade level. It is reasonable to consider the vertical scale to be the aspect of the curriculum that chanes only in difficulty across rades. This trait will be correlated with the within-rade scales, but it is not identical to it. Denotin the within-rade trait θ and the vertical trait ψ, ψ θ ik ik = a = β ψ + v ik k + v + w k ik + w ik. (4) The mean of the vertical trait increases with rade (subscripted ). A student s proficiency on the vertical trait at a point in time would reflect both school (or classroom) and student effects. The rade-specific trait reflects its correlation with the vertical trait, as well as additional schoolspecific effects. Implications for Vertical Linkin This model of item response suests several factors that will likely influence the stability of linkaes across rades, iven a fixed number of linkin items and a fixed sample size: The correlation between the rade-specific and vertical traits. The unidimensional linkin model is misspecified when two different traits contribute to eneratin the responses to the items. If these traits are perfectly correlated, they function as a sinle trait, eliminatin this source of error. As this correlation decreases, the resultin error increases. θ The manitude of the intra-cluster correlation found in and Ψ. When units within a cluster are more similar to each other than they are to units in other clusters, they provide less information than an equivalent number of units from a random sample. The impact that samplin desin has on the precision of estimates is described by the desin effect, which is often summarized as a ratio of the actual samplin variance to the samplin variance that would result from a simple random sample of the same size (i.e., the ratio of the actual standard error squared to the standard error squared from a simple random sample; Kish, 965). The manitude of the intra-cluster correlation in item-specific responses. As above, positive intra-cluster correlation will increase the desin effect. 4 American Institutes for Research

6 In eneral, these forces constitute violations of the assumptions underlyin the basic IRT models used to calibrate and link most tests. When faced with such violations, statisticians typically take one of two strateies: develop more complex models that more accurately model the processes of interest or develop methods that are robust to such violations. A number of researchers have developed structural IRT models that capture more of the real-world complexities (e.., Patz & Junker, 999; Kamata, 998, 200; Skrondal & Rabe-Hesketh, 2003; Rabe-Hesketh, Skrondal, & Pickles, 2004; Glas, 2005). These models often require additional distributional assumptions, possibly reducin their robustness. Here, we take an alternative approach common in the samplin literature, usin the more familiar vertical-linkin point estimators and constructin robust confidence intervals around them (e.., Cochran, 977; Kish, 965; Sarndal, Swensseon, & Wretman, 992). 3. Two Models for Creatin Vertical Scales and Evaluatin Their Precision This section describes the technical details of the joint calibration and chain-linkin procedures and the proposed standard error estimators for each. The calibration procedures can be successfully implemented in a variety of ways, throuh conditional maximum likelihood (CML, as in Conquest or OPLM), marinal maximum likelihood (MML) or nonparametric marinal maximum likelihood (NPMML, as in Bilo, Parscale, or Loismo), or, with a sufficient number of items, joint maximum likelihood (JML, as in Winsteps). This study uses NPMML procedures, which have been shown to share the optimal properties of CML estimators (De Leeuw & Verhelst, 986) and for which robust variance estimators are available (Cohen, Chan, Jian, & Seburn, 2005). The Joint Calibration Procedure The joint calibration procedure develops the common vertical scale in a sinle step by calibratin all items from all rades simultaneously. Given G rades to link, the lo-likelihood has the followin form: G lo L= l, = 3 where l is the rade-specific (marinal) lo-likelihood iven by N l = lo L( i θ, ) f( θ)dθ i= z β, (5) where N is the number of students in rade, θ is the ability, f is its rade-dependent density function, β = ( β, K, βj) is the collection of item parameters, and zi = ( zi, K, zij ) is the row of the full response matrix Z correspondin to student i in rade. Usin the independence assumption, we see that the likelihood of row i is 5 American Institutes for Research

7 J L( z θ, β ) = p( z θ, β ). i ij j j= Note that Equation 5 contains an explicit model of the population distribution within a rade. This decomposition of the likelihood and the population distribution provide the framework for vertical linkin. The connections across rades are made by the sets of common items assined to students in adjacent rades. The NPMML proceeds by replacin f with an empirical vector of normalized weihts p = ( p, K, p ) on a prespecified collection of population parameters (quadrature points) θ = ( θ, K, θ ), resultin in the followin approximation of the rade-specific lo-likelihood: N l l = lo L( zi θq, β) pq, i= q= with the constraint that pq =, for all. (6) q= In order to identify the model, it is sufficient to fix the mean proficiency within a sinle rade. For simplicity, let us fix the lowest rade proficiency to 0; that is, let θ qpq = 0. (7) q= If we assume that the location of the quadrature points are fixed, rather than estimated, the task becomes findin the conditional maximum of l ( Z, β, p, K, pg) = l G = 3 subject to the constraints in Equations (6) and (7). Because we want the conditional maximum place, we use Larane multipliers to redefine the likelihood function l to include the constraints with the new estimable parameters: l( % Z, β, p, K, p, µλ,, K, λ )= l% + µ θp, 3 G 3 G q q3 = 3 q= G Note that J, the number of items, may also depend on rade. However, to avoid cumbersome formulae, we suppress any notation indicatin this dependence. 6 American Institutes for Research

8 where l% = l + λ pq. q= We use an extension of Bock and Aiken s (98) EM alorithm to implement the NPMML estimation (see Cohen et al., 2005). This calibration yields rade-specific population distributions. From these we can readily obtain an estimate of the population moments; for example, the first moment (in the rades in which it is not fixed) is iven by µ = θ p. Standard Error of Parameters q= q q In a simple random sample, the likelihood of the data is the product of the individual likelihoods across observations, estimated by takin los and summin across observations. When the observations are correlated, as in a clustered sample, the function is no loner a true likelihood function the joint likelihood of the observations is no loner the product of the likelihoods of each observation because they nelect the covariance amon observations. Psychometricians continue to use estimates based on this likelihood function, even thouh it does not accurately model the real-world process of interest. The score function constitutes an estimatin equation in the sense of Godambe (960) and Godambe and Thompson (984), and the parameters of that function continue to hold pramatic interest in operational testin prorams. The inverse of the would-be information matrix, however, no loner provides an acceptable approximation of the variance of those estimates (Binder, 983; Godambe & Thompson, 984). For that reason, we use a Taylor-series approximation of the standard error, based upon the work of Binder (983). To develop the approximate variance estimator, we bein by reparameterizin the likelihood function. There are, in eneral, two equivalent approaches to estimatin constrained maximum likelihood models. The first, which we mention above, is based on the constrained likelihood (by introducin Larane multipliers). The second, based on a reduced likelihood function, is obtained by eliminatin redundant parameters (Mislevy, 984). Followin Mislevy, we reparameterize to eliminate redundant parameters, usin the information from the constraints to calculate the eliminated parameters in the full model. More precisely, we reard the last two population mass parameters p and p as functions of the previous 2 (because there are two constraints): p aθ b = θ θ and p aθ b =, θ θ 7 American Institutes for Research

9 where a 2 2 = pq and b q= q= = θ p. q q Let us define the weihted score function as the first derivative of the marinal lo-likelihood with respect to the reduced set of parameters of the model red γ = (, ) = (, ( p, K, p 3, p 2) ) β p β, K nk red w ( γ) = ( β, p ) = γ red = γ klo ( zi θq, β) q k= i= q= W W D l D w L p, where wk ( k =, K, K) is the samplin weiht associated with cluster (or PSU, primary samplin unit) k, and n k is the size of cluster k (aain, for the sake of transparency we inore stratification). In our context, the equation W ( γ) = 0, ( γ =?) (8) provides an estimatin equation in the sense of Godambe and Thompson (984) by which we may obtain consistent estimates of the finite population variances usin the formulae of Binder (983). To see this, let us assume that γˆ is the solution of the estimatin equation (8) in the sample and γ is the solution based on the full finite population or the set of all possible populations of interest. Then in first order we have W ( γ ) 0 = W ( ˆ) γ = W ( γ ) + ( ˆ γ γ ) + R. γ From this we obtain and W ( γ ) ˆ γ γ = γ W ( γ ) Var( ˆ) γ = ( ˆ γ γ )( ˆ γ γ ) T W ( γ ) = γ W ( γ ) W ( γ ) T W ( γ ) γ. 8 American Institutes for Research

10 Introducin Ω (γ ) as a variance of W ( γ ) across observations and takin the expectation value over γ, we obtain the covariance matrix of the reduced set of parameters: γ W ( γ ) W ( γ ) red = Var( ˆ) γ = Ω( γ ) γ = ˆ γ. γ γ To estimate Ω ˆ ( ˆ γ ) of Ω ( γ ), we use the stratified, between-psu weihted estimator, which is iven by K ˆ K T Ω ( ˆ γ ) = ( k )( k ), K k = where nk = D w lo L( θ, ) p z β and k γ k i q q i= q= γ= ˆ γ K = k. K = k Standard Error of Moments When creatin a vertical scale, we are interested in the estimates of the first moment of the population distribution (fixin this moment in one of the rades to zero). The previous section p yields the reduced covariance matrix Σ of the population mass parameters as a submatrix of γ Σ red : Σ Σ γ. β 2 Σ red = p Σ 2 Σred red p To obtain the covariance matrix Σ of the full set of population mass parameters, we first p p p T compute the covariance matrix Σab of ( p, K, p 2, a, b) via Σ ab = DabΣ reddab, where D ab O =. L θ θ2 θ L 2 p p T Then, Σ = D Σ D with ab 9 American Institutes for Research

11 D I I 2 2 = D = 2 θ θ θ θ θ, θ θ θ θ θ Where, I 2 is the 2 dimensional identity matrix. Note that Finally, the moment covariance matrix D a p. 2 b = p p p T Σ M = MΣ M for any rade is calculated. Here, M θ θ2 L θ θ θ2 θ L =. M M M M θ θ2 L θ We note that this approach nelects the covariance amon moment estimates across rades. The impact of this simplification remains an open, empirical question. The Separate Calibration, or Chain-Linkin Procedure Unlike concurrent calibration, separate calibration estimates the parameters for each rade separately and then links them throuh the use of linkin items common to multiple rades. One of the rades becomes the base scale to which subsequent tests are linked; here, we use the lowest rade (say, rade 3). Usin the common items between the rade 3 base scale and the next rade (rade 4), we determine a transformation that puts the item parameters from rade 4 on the same scale as rade 3. This process of chain-linkin repeats until all rades are scaled to the rade 3 base scale. Vertical linkin, when performed via separate calibration, is a localized operation, consistin of pair-wise linkaes of consecutive rades that establish the vertical scale. Vertical Linkin Baseline: Grade 3. With rade 3 as our base, the vertically linked scale score L θ 3i for the i th student in rade 3 coincides with the i th student within-rade scaled proficiency L θ 3i, that is, θ3 i = θ3 i. Vertical Linkin of Grade 3 to Grade 4. In rade 4, θ 4i denotes the achievement of the i th student from the within-rade scalin of the rade 4 items. We link rade 4 to rade 3 throuh a set of linkin items, items that are common to both the rade 3 and rade 4 tests. If there are m 34 of these items, b (3)4 is the vector of Rasch difficulty estimates for these items when they are b ) scaled within the fourth-rade data and 3(4 is the vector of difficulty estimates for the same 0 American Institutes for Research

12 b(3)4 items when the fourth-rade data is calibrated. Let b3(4) and be the means of these parameter estimates. Then, standard Rasch practice links the rade 4 achievement scale to the rade 3 achievement scale via where θ = θ + B, L B34 = b(3)4 b3(4) is the linkin constant. Because the linkin constant is estimated from both a sample of students and a sample of test items (those selected for linkin), it is subject to samplin error from both sources. The error from the samplin of items arises because linkin items are not entirely exchaneable in the linkin process a different sample of items would yield a different linkae. Under the assumption that the samplin error is independent across items, the variance of the vector b(3)4 b3(4) of lenth m 34 should reflect both sources of error. This assumption, of course, is unlikely to be true, and the consequences of its violation remain an open, empirical question. The linkin constant is the averae over the vector b(3)4 b3(4), so we propose to approximate the standard error of the linkin constant by the standard error of this mean, Var B Var b b b b B m = ( (3)4 3(4) ) = ( (3)4, j 3(4), j 34), m34( m 34 ) j= ( ) ( ) Var ( B ) SE B = L L The variance of the mean of rade 4 students is (with µ 4 = θ4 and µ 4 = θ4 ): ( µ L 4 ) ( µ 4 ) ( 34 ) ( 34 ) Var = Var + Var B = Var B. The latter holds because the population means in separate calibration are fixed to zero; they are not estimated. Comparisons between the rades 3 and 4 on the vertical scale must contain the Var B. That is, variance component ( ) 34 Var ( µ L 4 µ 3) Var ( B34) =. American Institutes for Research

13 Vertically Linked Scale. Applyin the formulae of the previous section to all pairs of adjacent rades creates the vertically linked trait scale that includes all rades. This results in a series of linkin constants ( B34, B45, K, BG, G) with correspondin variances ( Var( B34), Var( B45),, Var( BG, G) ) K. As with concurrent calibration, when analyzin the ability shifts rades < ', we include the followin variance component: µ µ amon two arbitrary L L ' Var( µ µ ) = Var( B + ). L L ' h, h h< ' Aain, these formulae are approximate because they treat the samplin variance of the means and linkin constants as thouh they are independent across rades. 4. Simulation Study This section describes a simulation study desined to compare the accuracy and precision of each of the linkin methods under realistic data conditions and to evaluate the efficacy of the proposed standard error approximations. We base the study on the vertical linkin sample desin used to link the Ohio Achievement Tests from rades 3 8. In eneral, this desin includes approximately 25 schools and 0,000 students per rade and six linkin items shared between each pair of adjacent rades (desin chanes implemented after this study was completed increased the actual number of linkin items in adjacent rades). Realistic values of some of the parameters of the model of item response set forth in Section 2 (above) were simply not known. To obtain realistic values, we enerated data sets from 54 different data confiurations, and we selected the confiuration that most closely approximated the desin effects observed in real item responses from a similar (within rade) sample desin. For ease of exposition, we refer to the dataset that yielded the most realistic within-rade desineffects the most Ohio-like confiuration of parameters. Usin the most Ohio-like confiuration, we enerated 00 datasets and applied both linkin procedures to evaluate the ability of the procedures to recover the eneratin parameters; the ability of the procedures to accurately approximate the precision of the estimates; the precision of the estimates from each procedure. 2 American Institutes for Research

14 The within-rade data that we used to match the parameter confiurations cannot, of course, inform our choice of values for the correlations between the within-rade and vertical scale. Therefore, we enerate additional datasets, holdin all parameters constant except the correlation, which we vary from.70 to.98 to observe the impact of this factor on the accuracy and precision of estimates. Data Generation Details Similar to the Ohio desin, our data also span rades 3 throuh 8. A sinle linkin form per rade consists of 33 core items (39 for rades 3 and 8) and 2 linkin items per form (except for rades 3 and 8, which had only 6 linkin items). We clustered our data within 25 elementary schools (rades 3 5) and 25 middle schools (rades 6 8). As is the case in Ohio, not all schools contributed data for all rades. We enerated 40 total schools, but only 25 schools contributed scores for any one rade (see Table ). Our data consisted of approximately 0,000 students per rade, for a total of 60,000 observations in each data set. Table : Data structure of simulated data sets: Number of schools for each rade, 60,000 observations total. Grade Number of Schools X X 40 X X 40 X 40 X 45 X X X 40 X X 40 X X 40 X 40 X 45 X X X Total number per rade Generation of Item Responses The data were enerated accordin to the model outlined in Section 2, A model of item response. First, we enerated the vector of latent traits θ and Ψ as specified in Equation 4. For convenience, we scaled the stochastic terms to yield traits with unit variance. Next, we enerated the stochastic components of the item response function as in Equation 3, takin care to scale these components to yield e ikj with a standard deviation of approximately.7, to match the standard loistic curve of the Rasch model. The final step calculated the item responses. Table 2 summarizes the key parameters of those models. 3 American Institutes for Research

15 Table 2: Summary of key factors likely to influence vertical linkae. Factor The linear relationship ( β ) between θ, the radespecific trait, and Ψ, the vertical trait Annual rowth, a Variance of school effects on vertical trait 2 ( var( vk ) = σ v( k ) ) Variance of school effects on rade-specific trait 2 ( var( w ) = σ ) k w( k ) Comments Parameters of the latent traits In our datasets, this coefficient is also the correlation coefficient. This is the averae rowth on the vertical scale in a year. For our study, we simply take this as a constant. This is the school effect on the vertical trait. This rade-specific school effect compounds the school effect associated with the latent trait. Because w k compounds v k, it is assumed to be small. Stochastic parameters of item response Variance of the item-specific This item-specific school effect school effects compounds the impact of school 2 ( var( u jk ) = σ u( jk ) ) effects on the latent traits. Dependin on curricular differences, it could be substantial. Recall that this is part of a stochastic term with a standard deviation of.7, so the larest value represents about 22 percent of the variance. Likely Rane of Realistic Values Values Used in Simulated Data Sets ,.90, , ,.0025, ,.2500,.6400 The final column of the table presents the candidate values for each parameter in the simulations. From that, we see that we have 3 * 3 * 3 * 2 = 54 possible combinations. To select the most realistic values, we calculated the averae desin effect on estimates of the percentae of correct responses to each item from a real data set (rade 3 readin data, drawn from a similar sample desin) and compared that to the observed desin effects in the simulated data sets. From these we identified the confiuration that most closely matched the real data. The details of these confiurations are presented in Appendix A. The parameters of the most Ohio-like confiuration are presented in Table 3. 4 American Institutes for Research

16 Table 3: Parameters of the most Ohio-like confiuration. Parameter Value β.98 a.50 2 σ v(k ).0 2 σ w(k ) σ u( jk ).25 Recovery of Generatin Parameters, Precision, and Accuracy of the Standard Errors We created 00 datasets, usin the most Ohio-like confiuration to evaluate the accuracy, precision and effectiveness of the standard error estimators of the two linkin methods. Table 4 compares the results of these simulations. The estimates of both methods reveal a small bias, which increases as the estimates cross additional rades. This findin seems intuitive because the trait measured is a compromise between the vertical trait and the somewhat attenuated rade-specific traits. The final columns compare the empirical standard errors (the standard deviations of the estimates across 00 datasets). From these, we see that the joint calibration produces estimates that are slihtly more efficient than the separate calibration procedure. Aain, this is reassurin because the joint calibration brins more information to bear in estimatin each item parameter. The standard error estimates from the joint calibration very closely match the empirical standard errors, but the proposed standard error estimator for the chain linkin underestimates the standard errors by about 5 20 percent. Table 4: Joint and separate linkin constants with standard errors, over 00 replications. Grades True linkin constant EB ( ) Linkin Constant Separate calibration estimate B j Joint Calibration estimate sep Standard Error of the Estimate Separate calibration Joint calibration Observed Standard Deviation of the Estimate Separate calibration Joint calibration B SE( B ) SE( B ) SD( B ) SD( B ) In summary, the two procedures offer virtually identical point estimates when averaed over many data sets. The joint calibration procedure is somewhat more efficient, and the proposed standard error estimator for the joint calibration procedure provides a more accurate approximation than the standard error estimator that we proposed for the chain-linkin procedure. sep j sep j 5 American Institutes for Research

17 ( Linkin Rasch Scales Across Grades in Clustered Samples Correlation Between Vertical and Grade-Specific Trait Given that the two procedures provide very similar results, and that the standard error estimates are more accurate for the joint calibration procedure, we analyzed the effect of varyin the correlation by usin only the joint calibration procedure. Table 5 describes the effect of varyin the correlation between the vertical and rade-specific trait for five of the data sets confiured to the most Ohio-like specifications with ρ ranin from.70 to.98. As this correlation increases, the standard error decreases within each rade. When ρ is small, the standard errors of the linkin constants and root mean square error (rmse) are lare. No clear relationship appears between ρ and the observed bias in the estimated linkin constants. ρ Table 5: Linkin constant and Standard Error for different correlations between the vertical and rade specific trait. Grades ρ E B ) B j SE( B j ) Bias RMSE Note: Results shown in this table are from five data sets, each created to the same most-ohio-like specification scheme with only the correlation between the rade-specific and vertical trait varyin. 6 American Institutes for Research

18 5. Conclusion The main oal of the study was to document the performance of the two linkin methods by usin realistic data and to verify the estimator of the standard error of the vertical linkin constant. This study has found that: The two methods produce nearly identical results, even when the vertical linkin items and the main assessment items load on separate, correlated traits; Our proposed standard error estimator for the joint calibration procedure matches the empirical standard errors to within 3 percent under complex sample desins; The joint calibration procedure produces moderately more efficient estimates, with reduction in the standard errors of 4 0 percent; Our proposed standard error estimator for the separate estimation method underestimates the empirical standard errors by 0 5 percent; The precision of the estimates is affected by the correlation between the on-rade traits and the vertical trait, with lower correlations associated with much less precise estimates. Other than the item misfit introduced by the clusterin of error terms and multidimensionality, this study did not address the impact of item fit on the standard errors. In real data, we expect that item misfit will contribute to larer standard errors than those presented here. 7 American Institutes for Research

19 Appendix A: Data Confiurations and Most Ohio-Like Data Generation The data simulated for this study were enerated to closely resemble real test data from Ohio. We wanted to match the Ohio desin as closely as possible, while simplifyin the data structure to facilitate analysis by eliminatin stratification and constructed-response items and by eneratin only a sinle test form per rade. Our initial step in data eneration was to select several potential realistic values that the parameters of interest could take (see Table 2 in the body of the report) and then to enerate data sets representin every unique combination of these values. The resultin 54 data set confiurations are described in Table A-. To select from these 54 the data sets the confiuration that most closely matched real Ohio data, we computed the desin effects for the simulated data, and we selected the data sets that most closely resembled the real desin effects from the rade 3 readin test (DE = 3.645). As can be seen from Fiure A-, one set of data sets more closely resembles the Ohio root desin effect than the others (correspondin to the data sets where we set the standard deviation of the item specific school effects parameter equal to 0.5) and include the data sets numbered (2, 5, 8,, 44, 47, 50, 53) in Table A-. We call these the Ohio-like datasets. Also apparent from this fiure is that chanin the value of this parameter has the larest impact on root desin effect. Varyin the other parameters has much smaller impact. 8 American Institutes for Research

20 Fiure A-: Grade 3 desin effects for 54 data sets representin all unique combinations of parameters likely to influence linkin error Mean Root Desin Effect Dataset NOTE: The solid line represents the root desin effect observed in the Ohio operational rade 3 Readin data (Root mean desin effect =.89) From these 2 Ohio-like confiurations, we selected one (data set #38) that was very close to the true Ohio desin effect observed in the third-rade Readin test data. The specifications for the Most Ohio-like data set are provided in Table A- and Table 3 (in the body of the report). The simulations described in this report are based on 00 data sets enerated to these specifications. 9 American Institutes for Research

21 Table A-: Confiuration of oriinal 54 data sets. Data Set ID Correlation between θ and θ SD of school effects on vertical trait SD of school effect on rade specific trait SD of item specific school effects Annual rowth NOTE: The Ohio-like data sets are bolded and the most Ohio-like data set, #38, is shaded. 20 American Institutes for Research

22 References Binder, D. A. (983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 5, Cochran, W. G. (977). Samplin techniques (3rd ed.). New York: John Wiley & Sons. Cohen, J., Chan, T., Jian, T., & Seburn, M. (2005). Consistent estimation of Rasch Item Parameters and their standard errors under complex sample desins. Manuscript submitted for publication. Cronbach, L. J., & Furby, L. (970). How we should measure chane or should we? Psycholoical Bulletin, 74(), De Leeuw, J., & Verhelst, N. (986). Maximum likelihood estimation in eneralized Rasch models. Journal of Educational Statistics,, Glas, C. A. W. (2005). Structural item response. In Encyclopedia of social measurement (Vol. 3). London: Elsevier Ltd. Godambe, V. P. (960). An optimum property of reular maximum likelihood estimation. The Annals of Mathematical Statistics, 3(4), Godambe, V. P., & Thompson, M. E. (984) Robust estimation throuh estimatin equations. Biometrika, 7(), Hambleton, R. K., & Swaminathan, H. (985). Item Response Theory: Principles and applications. Boston: Kluwer-Nijhoff. Hanson, B. A., & Béuin, A. A. (2002). Obtainin a common scale for Item Response Theory item parameters usin separate versus concurrent estimation in the common-item equatin desin. Applied Psycholoical Measurement, 26(), Harris, D. J. (99). Effects of passae and item scramblin on equatin relationships. Applied Psych Measurement, 5(3), Kamata, A. (998). Some eneralizations of the Rasch model: An application of the hierarchical eneralized linear model. Unpublished doctoral dissertation, Michian State University, East Lansin. Kamata, A. (200). Item analysis by the Hierarchical Generalized Linear Model. Journal of Educational Measurement, 38(), Karkee, T., Lewis, D. M., Hoskens, M., & Yao, L. (2003). Separate versus concurrent calibration methods in vertical scalin. Paper presented at the annual meetin of the National Council on Measurement in Education, Chicao. 2 American Institutes for Research

23 Kim, S.-H., & Cohen, A. S. (998). A comparison of linkin and concurrent calibration under item response theory. Applied Psycholoical Measurement, 22(2), Kish, L. (965). Survey samplin. New York: John Wiley & Sons. Kolen, M. J. (98). Comparison of traditional and item response theory methods for equatin tests. Journal of Educational Measurement, 8,. Kolen, M. J., & Brennan, R. L. (2004). Test equatin, scalin and linkin: Methods and practices. New York: Spriner-Verla. Linn, R. L., & Slinde, J. A. (977). The determination of the sinificance of chane between pre and posttestin periods. Review of Educational Research, 47, Lord, F. M. (980). Application of item response theory to practical testin problems. Hillsdale, NJ: Erlbaum. Mislevy, R. J. (984). Estimatin latent distributions. Psychometrika, 49, Muraki, E., Hombo, C. M., & Lee, Y.-W. (2000). Equatin and linkin performance assessments. Applied Psycholoical Measurement, 24(4), Patz R. J., & Junker, B. W. (999). Application and extension of MCMC in IRT: Multiple item types, missin data, and rated response. Journal of Educational and Behavioral Statistics, 24(4), Peterson, N. S., Cook, L. L., & Stockin, M. L. (983). IRT versus conventional equatin methods: A comparative study of scale stability. Journal of Educational Statistics, 8(2), Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation modellin. Psychometrika, 69(2), Rasch, G. (960). Probabilistic models for some intellience and attainment tests. Copenhaen: Denmarks Paedaoiske Institut. Roosa, D. R., Brandt, D., & Zimowski, M. (982). A rowth curve approach to the measurement of chane. Psycholoical Bulletin, 92, Sarndal, C. E., Swenson, B., & Wretman, J. (992). Model assisted survey samplin. New York: Spriner-Verla. Skas, G., & Lissitz, R. W. (988). IRT test equatin: Relevant issues and a review of recent research. Review of Educational Research, 56(4), Skrondal, A., & Rabe-Hesketh, S. (2003). Multilevel loistic reression for polytomous data and rankins. Psychometrika, 68(2), American Institutes for Research

The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated Data

Int. Statistical Inst.: Proc. 58th World Statistical Congress, 20, Dublin (Session CPS008) p.6049 The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated