A Unified Approach to Linear Equating for the Non-Equivalent Groups Design

Size: px

Start display at page:

Download "A Unified Approach to Linear Equating for the Non-Equivalent Groups Design"

Milton Mason
6 years ago
Views:

1 Research Report A Unified Approach to Linear Equating for the Non-Equivalent Groups Design Alina A. von Davier Nan Kong Research & Development November 003 RR-03-31

3 A Unified Approach to Linear Equating for the Non-Equivalent Groups Design Alina A. von Davier and Nan Kong Educational Testing Service, Princeton, NJ November 003

4 Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from: Research Publications Office Mail Stop 7-R Educational Testing Service Princeton, NJ 08541

5 Abstract This paper describes a new, unified framework for linear equating in a Non-Equivalent-groups Anchor Test (NEAT) design. We focus on three methods for linear equating in the NEAT design Tucker, Levine observed-score, and chain and develop a common parameterization that allows us to show that each particular equating method is a special case of the linear equating function in the NEAT design. We use a new concept, the Method Function, to distinguish among the linear equating functions, in general, and among the three equating methods, in particular. This approach leads to a general formula for the standard error of equating for all equating functions in the NEAT design. We also present a new tool, the standard error of equating difference, to investigate if the observed difference in the equating functions is statistically significant. Key words: Test equating, Non-Equivalent groups Anchor Test (NEAT) design, Tucker equating, Levine observed-score equating, chain linear equating, standard error of equating, delta method i

6 Acknowledgments The authors would like to thank Paul Holland, Neil Dorans, Hariharan Swaminathan, Dan Eignor, Shelby Haberman, Skip Livingston, and Krishna Tateneni for helpful comments and suggestions during the development of this project. We are also thankful to Bruce Kaplan and Ted Blew, who were very supportive in developing the software and carrying out the resampling procedure. We would also like to thank Kim Fryer, Elizabeth Brophy, and Diane Rein for their help in editing the manuscript. The Educational Testing Service Research Allocation supported our work. Any opinions expressed in this paper are those of the authors and not necessarily of Educational Testing Service. ii

7 Test equating methods are statistical tools used to produce exchangeable scores across different test forms. In particular, observed-score equating methods, as opposed to true-scores equating methods, refer to the transformation of the raw scores of a new test, X, on the raw scores of an old test, Y. Any test equating process consists of a data collection design and different equating methods. This paper focuses on the Non-Equivalent-groups Anchor Test (NEAT) design and the linear equating function. The NEAT design is a data collection design widely used in practice. It involves two populations of test-takers (usually different test administrations), P and Q, and a sample of examinees from each. The sample from P takes test X, the sample from Q takes test Y, and both samples take an anchor test, V, which is used to link X and Y. For the NEAT design, several observed-score equating methods are commonly used. Here we focus only on observed-score linear equating methods for the NEAT design. In this paper, we take a new, mathematical approach to three linear observed-score equating methods Tucker equating for the NEAT design with an anchor (T), Levine observedscore equating (L), and chain linear equating (CL) by emphasizing their common framework and their similarities. We define these methods carefully later. In this way, we introduce a unified approach to linear equating in the NEAT design, and we show that each of these equating methods is a special case of the linear equating function. This approach allows us to establish new theoretical results on otherwise well-known equating methods, creating a conceptual shift in the analysis of the observed-score linear equating methods in the NEAT design: There are not only disparate methods, each with its own framework, but they share the same parameter space and have numerous similarities. One of the consequences is that we can develop one general formula for the standard error of equating (SEE) that is applicable to most of the observed-score linear equating functions for the NEAT design that are available. We also introduce in this paper a new, practical tool, the standard error of equating difference (SEED), for investigating whether the differences between the (linear) equating methods are statistically significant. More precisely, in this paper we investigate the linear equating in the NEAT design from several points of view: 1. We put the three methods on a common footing by developing the same parameterization for the (equating) functions. We also make use of the concept of a 1

8 Method Function in the framework of the linear equating to show that there is only one definition of the linear observed-score equating function that might have different special cases. (A related concept, the Design Function, was first introduced in von Davier, Holland, & Thayer, 004, in a different context, to model the data collection design.) This approach leads to a general formula for the SEE for linear equating in the NEAT design in general and for the three equating functions in particular.. We generalize the SEED (von Davier et al., 004) for any pair of (linear) equating functions that share the same set of parameters. The SEED is a new tool to investigate whether the difference between the equating functions is statistically significant. 3. We use real and resampled data from two national administrations of a high volume testing program to illustrate the SEE (computed via the new general formula) and the SEED. We do not make any distributional assumptions about the variables involved in this theoretical exposition. Linear Equating Function for the NEAT Design This section sets up the basic notation. We assume there are two tests to be equated, X and Y, and a target population, T, on which this is to be done (Braun & Holland, 198; Kolen & Brennan, 1995). In this paper, we use the standard notation of µ and for the means and the variances. We also use the symbol π to denote the parameters in general. The subscripts usually indicate the variable and the population. We use with appropriate subscripts to denote the covariances; for example, XVP, ; denotes the covariance of X and V in P, while Σ denotes the covariance matrix of π. Many observed score equating methods are based on the linear equating function. Usually, the rational behind the linear equating on the target population, T, is to set standardized deviation scores (z-scores) on the two forms to be equal such that x µ XT y µ = XT YT YT,

9 where µ YT, YT, µ XT, and XT are the means and the variances of X and Y in T. Solving for y in the above equation results in the formula for the linear equating function, ( x) µ (( x µ ) ) Lin = + /. (1) XY ; T YT YT XT XT In the NEAT design, there are also another rational and implicitly another definition of a linear equating function, that is, the chain linear equating function. The chain linear equating function is given by chaining together the two linear linking functions (i.e., by using the mathematical composition of the two linear functions), from X to V on P and from V to Y on Q, that is, Lin ( x) and Lin ( ) XV ; P VY ; Q v. This results in ( ( )) CL ( x) = Lin Lin x XY VY ; Q XV ; P ( / )(( ( / )( x )) ) ( / )( ) ( / )( / )( = µ + µ + µ µ YQ YQ VQ VP VP XP XP VQ = µ + µ µ + x µ YQ YQ VQ VP VQ YQ VQ VP XP XP ), () and () is the usual form for the chain linear equating function. Moreover, the final equating function does not depend on the target population, T. As shown in von Davier, Holland, and Thayer (in press), () can be rewritten as (1) under appropriate assumptions. This will be discussed in more detail later. Usually, X and Y are the operational tests given to two samples from the two test administrations P and Q, respectively, and V is the anchor test given to both samples from P and Q. The anchor test score, V, can be either a part of both X and Y (called the internal anchor) or a separate score (the external anchor). In this study we assume that the target population, T, for the NEAT design is a mixture of P and Q and is denoted by ( 1 ) T = wp+ w Q, (3) 3

10 (see Braun & Holland, 198, or Kolen & Brennan, 1995, for details on the concept of a target population in the NEAT design). The target population in (3) is determined by a weight w. When w = 1, then T = P, and when w = 0, then T = Q. Other choices of w may be used as well. Typically, w is the ratio of the sample size of the group from P and the sum of the sample sizes of the two groups. In the NEAT design, X and Y are each only observed either on P or on Q, but not both. Thus, X and Y are not both observed on T, regardless of the choice of w. For this reason assumptions must be made in order to overcome this lack of complete information in the NEAT design. The three equating methods used in the NEAT design that concern us here, Tucker, Levine, and chain linear equating, make different assumptions about the distributions of X and Y in the populations where they are not observed. We identify these assumptions in the next section. Tucker, Levine, and Chain Linear Equating Methods In this section we briefly describe the methods we use and their assumptions, which can be found in more detail somewhere else (Kolen & Brennan, 1995, pp ; Angoff, 1984; von Davier et al., in press). Here we provide only the information that is necessary to explain our new approach, which is given in more detail later in this paper. This section is structured as follows: First, we present the Tucker and Levine together, stressing the similarities between them. Although the assumptions that underlie the two methods are different, the computational forms are similar. We will not give computational details on Tucker and Levine because they are well documented (Kolen & Brennan, 1995, pp ). Then, we describe the chain linear equating method, following the development given in von Davier et al. (in press). In the next section, we develop a common parameterization for the three functions that allows us to compare the equating functions as well as their standard errors (SEE). Tucker Equating Method: Assumptions T1: The linear regressions of X on V and of Y on V are the same in the two populations. T: The conditional variances of X given V and of Y on V are the same in the two populations. 4

11 Levine Observed-score Equating Method: Assumptions L1: X, Y, and V all measure the same thing, or, stated in different words, the true scores of the tests (T and T ) and of the anchor (T ) in the two populations are perfectly correlated. X Y V L: The regressions of T on T and of T on T are linear and the same in the two populations. X V Y V L3: The measurement error variances for X and for Y are the same in the two populations. From the two sets of assumptions and from (1) the formulas for the parameters of X and Y on T for Tucker and Levine follow. They are similar in form for the two equating methods: ( w) µ XT = µ XP 1 P µ VP µ VQ, (4) µ = µ + w µ µ YT YQ Q VP VQ, ( ) ( ) XT = XP 1 w P VP VQ + w 1 w P µ VP µ VQ, (6) ( 1 ) YT = YQ + w Q VP VQ + w w Q µ VP µ VQ (7) (5) (see Kolen & Brennan, 1995, pp for the derivations). The four -parameters, which distinguish the two equating methods, Tucker and Levine, have the following formulas: For the Tucker method: = α = = α = (8) XVP, ; YVQ, ; P P and, Q Q VP VQ denotes the covariance of X and V in P and YVQ, ; denotes the covariance of Y and where XVP, ; V in Q. anchor: For the Levine observed-score equating function for a NEAT design with an external 5

12 + + = = and = =, (9) XP X, V ; P YQ Y, V ; Q P γ P Q γq VP + X, V ; P VQ + Y, V ; Q which are the formulas for the Levine function derived under the assumptions L1-L3 and the additional assumption of a congeneric model, for which the error variances are proportional to the effective test lengths (see Kolen & Brennan, 1995, p. 117). For the Levine function for a NEAT design with an internal anchor: = = and = =, XP YQ P γ P Q γq XVP, ; YVQ, ; (10) which are also derived under the additional assumption of a congeneric model (see Kolen & Brennan, 1995, p. 116). Chain Linear Equating Method: Assumptions C1: The (linear) linking function from X on V is the same in the two populations, P and Q. C: The (linear) linking function from V on Y is the same in the two populations, Q and P. We follow the notations and approach to chain linear equating given in von Davier et al. (in press, Appendix A). We do not give any computational detail in this paper; instead we refer to that work and quote only those formulas from it that are necessary for our exposition here. As shown in von Davier et al. (in press, Appendix A) from C1 and C, it follows that on a target population, T, as defined in (3), we have ( / )( XT XP XP VP VT VP ) µ = µ + µ µ, (11) ( / ) = (1) XT VT VP XP, = + ( / )( ) µ µ µ µ YT YQ YQ VQ VT VQ ( ), and (13) = /. (14) YT VT VQ YQ 6

13 Von Davier et al. (in press) shows that under the assumptions C1 and C made by the chain equating CL XY ( x ) defined in () is, in fact, Lin XY ; T ( ) x, as defined in (1). More precisely, that work shows that applying (11) (14) to the chain linear function from () results in (1). The ( XV P ) target population, T, cancels out of the composed function, n Lin ( ) Li x. This provides a VY ; Q ; direct argument that chain linear equating is the linear observed score equating on T with µ µ given by the expressions in (11) (14). XT, YT, XT, and YT Identifying the Parameters of the Tucker, Levine, and Chain Linear Equating Functions In this section, we introduce a common parameterization for the linear equating functions described above. We show that this approach leads to a unified framework for all the linear equating functions in the NEAT design. Consider the linear equating function (1) that equates X to Y on the target population, T, in the form of (3). This equating function depends on µ,, µ,and, which are XT XT YT YT parameters on the population T. We can express this dependence of the equating function on the target population parameters by using the notation Lin XY ; T ( x ; µ XT, XT, µ YT, YT ) = a generic linear equating function. (15) In (4) (14), we observe that the four parameters on T depend on the 10 means, variances, and covariances in the two populations, P and Q. Denote by π the column vector of the 10 parameters from the two bivariate distributions, that is, π = ( µ,, µ,,, µ,, µ,, ) t. (16) XP XP VP VP X, V ; P YQ YQ VQ VQ Y, V ; Q We use a new concept, a function that will map the 10 parameters from the two populations, P and Q, into the four parameters on the population T. To preserve the similarities to von Davier et al. (004) and to emphasize the similarities across the equating methods, we will call this function the Method Function (MF). 7

14 t MF ( π ) = ( µ XT, XT, µ YT, YT ). (17) Now, we rewrite (15) as ( Lin ;MF( ) with π defined in (16). ) XY ; T x π = a linear equating function obtained through a specific MF, (18) The previous section showed that all three linear equating functions, Tucker, Levine, and chain linear, can be expressed as (1). Thus, they can also be expressed as (18), in which the Method Function differs according to which equating method is used. For the Tucker method, the Method Function is described by the formulas (4) (8). For the Levine method, the Method Function is given by (4) (7) and (9) for an external anchor, and by (4) (7) and (10) for an internal anchor. For the chain linear method, the Method Function is described by (11) (14). Each Method Function is given in detail in the appendix in Table A1. From () as well as from (11) (14), we observe that the covariance between X and V on P and the covariance of Y and V on Q do not appear in the formulas of the chain linear equating function. Hence, the chain linear equating function depends only on eight parameters, while the Tucker and Levine functions depend on ten parameters. However, by using (15), (17), and (18), we can express the three linear equating functions as sharing the same parameter space. Note the chain linear function implicitly depends on the covariances between the tests and the anchor only if before computing the equating function, the two bivariate distributions of the tests and the anchor are presmoothed using, for example, log-linear models (see von Davier et al., 004; Holland & Thayer, 000). in (18), that is, Equating functions are estimated by substituting estimates of the population parameters XY ; T ( x π ) = ; ( ( ˆ XY T x π) ) Lin ;MF( ) Lin ;MF, (19) where ˆπ denotes a sample estimate of π. ( ( ) ) The uncertainty in Li n XY ; T x;mf π derives from the uncertainty in the estimate of π. Because the samples are independently drawn from populations P and Q, the covariances 8

15 between each of the five parameters estimated from the population P and the five parameters estimated from the population Q are zero. Hence, the covariance matrix Σ of the parameter π for the three equating functions, Tucker, Levine, and chain linear, is: ΣP 0 Σ =, 0 Σ Q (0) where ΣP denotes the covariance matrix of the five parameters obtained from the population P and Σ denotes the covariance matrix of the five parameters obtained from the population Q. Q Also note that the Braun and Holland linear equating method for the NEAT design (Braun & Holland, 198; Kolen & Brennan, 1995, p. 146) shares the same parameter vector π and has the same covariance matrix of π, as in (0). In this section, we introduced a common parameterization that can be used for most of the available observed-score linear equating functions in the NEAT design. We showed that one could write down the Method Function formulas for each of three methods that we analyzed here, and we think that one could easily write the appropriate Method Function for any other observed-score linear equating function. However, the investigation of additional equating functions is beyond the scope of this study. Standard Error of Equating In this section, we show that using a common parameterization for all linear equating functions in a NEAT design leads to a general formula for the SEE. The delta method, a general method for approximating standard errors that is based on the Taylor expansion (Rao, 1965; Kendall & Stuart, 1977), is widely used for computing standard errors. Kolen (1985) and Hanson, Zeng, and Kolen (1993) used the delta method to compute the SEE for the Tucker method and the Levine method, respectively. Although we also use the delta method for computing the SEE, our approach differs from Kolen (1985) and Hanson et al. (1993) in the following sense: We provide a unified approach that, through the MF, includes not only the Tucker and the Levine methods, but also chain linear equating and other linear observed-score equating functions such as the Braun and Holland 9

16 linear equating method (Braun & Holland, 198; Kolen & Brennan, 1995). In order to emphasize this unity, we focus on the matrix form of the SEEs, (1) below, rather than on the sum form, as did Kolen (1985) and Hanson et al. (1993). The approach presented here has similarities with the approach developed in von Davier et al. (004). Delta Method Applied to Linear Equating We use the delta method to calculate the asymptotic variance, Va r Lin XY ; T x;mf( π), whose square root is the SEE. ( ( )) From the delta method (Theorem A1 in the appendix), it follows that the asymptotic variance of a smooth function, f, that depends on the parameter vector, π, is Var( f ( π) ) = J ( π) Σ( π) J ( π) f t f (1) where J f ( π) is the Jacobian (the matrix or vector of the first derivatives of the function f with respect to the components of π ) computed at the estimated values of π (see also von Davier et al., 004; von Davier, 001). Let the parameter π from (16) be the parameter vector described in Theorem A1 and let f be a linear equating function, ( ) Lin ;MF( ) XY ; T x π, given in (1). The Method Function can refer to any of the Tucker, Levine, and chain linear functions. The Jacobian of ( ) Lin ;MF( ) XY ; T x π according to matrix differentiation theory and differentiability of composition of functions, where J Lin J = J J, f Lin MF is the vector of the first derivatives of the function from (1) with respect to ( µ XT, XT, µ YT, YT ). MF is the matrix of the first derivatives of J ( XT, XT, YT, YT ) is, µ µ with respect to the components of π from (16). In the previous section we showed that the (10 by 10) covariance matrix Σ is the same for the Tucker, Levine, and chain linear functions. Moreover, the Jacobian JLin will also have the same form for all observed-score linear equating functions (the Jacobian of the linear function, for any of the three equating functions, is a 4-dimensional (row) vector). The Jacobian J MF is a 4 by 10-matrix and will have a different form for each of the equating methods. 10

17 ( ) Now, by using (1), the SEE of a linear equating function, Lin XY T x;mf( π), can be expressed as ; ˆ ( ) x = Jˆ Jˆ Σ Jˆ J ˆ () t t SEE Lin MF MF Lin, with Σ from (0). Equation () is the computational formula for the SEE for the Tucker, Levine, and chain linear methods, that is, the formula that might be implemented into a computer program. It is easy to see the computational advantages of having only one formula for the SEE for all linear equating methods. Note that this formula does not require any distributional assumption on the variables involved. The entries of in () can be obtained from Kolen, The derivatives J = J J Σ f Lin MF for the Tucker equating function are given in Kolen (1985). The derivatives J f = J Lin J MF for the Levine function for a NEAT design are given in Hanson et al. (1993). The derivatives J = J J f Lin MF for the chain linear equating, given in Table A, were computed by us. We use the notations SEE T, SEE L, and SEE CL to refer to the SEE for the Tucker, Levine, and chain linear methods, respectively. SEED for Linear Equating Functions In this section, we state a new result that is analogous to (1) and that will allow us to compute a standard error for the difference between two linear equating functions. This standard error can be used to inform discussion about the final form of an equating function. The SEED was first introduced in von Davier et al. (004) for the kernel method of test equating. This paper applies the same concept to the linear equating functions. The main differences between the SEED in von Davier et al. (004) and the SEED here lie in the fact that the parameters of the equating functions and the equating functions themselves differ. In the kernel method of test equating, the parameters are the score probabilities of the tests to be equated (and, in chain equipercentile, also the score probabilities of the anchor test); in the case of linear equating, the parameters are the means, the variances, and the covariances of the tests to be equated and of the anchor test in the two populations, P and Q. 11

18 Consider two equating functions Lin ( x;mf ( π) ) and Lin ( x;mf ( π )) 1, which have the form given in (1) and depend on the same parameter vectors from (16) (i.e., the assumptions on the functions required by the delta method are met). We are interested in ( 1 ( x ) ( π )) V ar Lin ;MF ( π ) Lin x;mf ( ). (3) Theorem 1. If n ( x;mf ( π) ) and Lin ( x;mf ( π )) Li are two equating functions that have the form given in (1) and depend on the same parameter vector, 1 π, from (16), then ( ( x ) ( x )) 1 Var Lin ;MF( π) Lin ;MF ( π) = Jˆ ( Jˆ Jˆ ) Σ( Jˆ Jˆ ) J ˆ, (4) t t 1 Lin MF MF MF1 MF Lin where J Lin is the 4-dimensional-row vector of the first derivatives of the function from (1) with respect to the parameters on T, ( µ XT,, µ YT, YT, J MF is the 4 by 10 matrix of the first derivatives of the four components of the Method Function, XT ) (,,, µ XT µ YT YT ) the components of π, and Σ is the variance-covariance matrix of π, given in (0). XT, with respect to The proof follows from the delta method (Theorem A1), applied to the difference of two smooth functions that depend on the same parameters (see also von Davier et al., 004, chapter 5). Hence, the SEED is ( ( x ) 1 ( x )) SEED =V ar Lin ;MF ( ) Lin ;MF ( ) π π. (5) Corollary 1. The SEEDs for any pair of the three equating functions, Tucker, Levine, and chain linear, are: SEED = Jˆ ( Jˆ Jˆ ) Σ( Jˆ Jˆ ) J ˆ, (6) t t T,L Lin T L T L Lin SEED = Jˆ ( Jˆ Jˆ ) Σ( Jˆ Jˆ ) J ˆ, (7) t t CL,L Lin CL L CL L Lin 1

19 SEED = Jˆ ( Jˆ Jˆ ) Σ( Jˆ Jˆ ) J ˆ, (8) t t T,CL Lin T CL T CL Lin with Σ from (0) The proof follows from Theorem 1. The entries of JLinJT are given in Kolen (1995). The entries of JLinJL are given in Hanson et al. (1993) and the entries of J J are given in Table A. Lin CL In conclusion, the SEED is a measure of the uncertainty in the difference between two equating functions that is due to the estimation of the parameters (the means, variances, and covariances in the two samples). It also reflects the differences in the two Method Functions. We propose the following practical rule: If the difference between two linear equating functions is no larger than the noise level in the data, then this difference would be smaller than twice the SEED in either direction (see also von Davier et al., 004). Study 1 Here we illustrate how the general formula for the SEE and a new tool, the SEED, for the Tucker, Levine, and chain linear methods can be applied using an example that involves data from two national administrations of a high volume testing program. The two testing administrations were in the fall of 001 (P) and in the winter of 000 (Q). We consider this example to be an informative one, in the sense that it departs from the ideal conditions described in von Davier (003) when the equating methods give the same results. Moreover, as seen later, the difference between the three equating functions of interest is about half score point or more, which is a difference that matters for the program from which the data come. (A difference in equating results that is large enough to make a difference in the reported scores is called a difference that matters.) The data, which were collected following a NEAT design with an external anchor, consisted of the raw sample frequencies of rounded formula scores for two parallel, 78 item tests and a 35 item external anchor test given to two samples from a national population of examinees. (The rounded formula scores are scores in which the right minus a quarter wrong formula scores are rounded to integers.) In this study, the negative scores were rounded to zero. 13

20 The data are sample frequencies for two bivariate distributions. We denote the two sets of sample frequencies by n jl = number of examinees with X = x j and V = vl, and m kl = number of examinees with Y = yk and V = vl. In this example, x = 0, x = 1,, x = 78 ; the same is true for y. For v, we have 1 79 v = 0, v = 1,, v = 35. The two sample sizes are given by: N =10,634 and M =11,31. The 1 36 sample correlation of X and V in P was 0.88, and the sample correlation of Y and V in Q was k l Table 1 Summary Statistics for the Observed Distributions of X, Y, V in P and V in Q X Y V P V Q Mean SD From Table 1 we see that the mean of the anchor test V is (±0.08) in population P, and (±0.08) in Q, where 0.08 is the standard error of the mean. Thus Q is a less proficient population than P, as measured by V. In terms of effect sizes, the difference between these two means (.66) is approximately 3% of the average standard deviation of 8.7. For this type of testing program, a mean difference of this magnitude indicates a fairly large difference between the two populations. Before chain linear equating was in use, ETS researchers were guided by the following rules when they had to choose between Tucker and Levine equating: If the standardized mean difference of the anchor scores in the two samples is smaller than 0.5, then choose the Tucker method, and If the ratio of the variances of the anchor in the two samples is between 0.80 and 1.5, then use the Tucker method (Kirk, 1971; Wichert, 1967). We couldn t find any rational explanation for these rules, especially for the cut-off values. Kolen and Brennan (1995, pp ), however, suggest choosing Levine when it is known that populations differ substantially and if there is also reason to believe that the forms are quite similar and choosing Tucker if the forms are suspected to differ, with the observation that if the populations [and 14

21 forms] are too dissimilar, then any equating is suspect and with the note that this ad hoc reasoning is by no means definitive. Hence, based on this information, one would have chosen the Levine equating function particularly for this example since the test forms are very carefully constructed to be parallel in this assessment program. We used the formulas (1), (4) (10) to compute the Tucker and Levine functions. We used () to compute the chain linear equating function. The equating functions, the SEEs, and the SEEDs are discrete functions of x. The three functions, shown in Figure 1, give relatively different results. The differences between the Tucker and Levine functions and the Tucker and chain functions are more than a half raw score point for the whole score range, which is a difference that matters. The difference between Levine function and the chain function is less than the size of a difference that matters for the whole score range. 15

22 EQUATING DIFFERENCE T-L T-C L-C X-SCORE Figure 1. The Tucker, Levine, chain linear functions. Study 1. NEAT design with an external anchor. The three SEEs are given in Figure. The shape of the SEEs is the usual one for linear equating functions, with lower values around the means and higher values for the extreme score ranges. The SEE for the Levine function seems to be larger than the SEE for the chain linear function and for the Tucker function, which has the smallest values almost overall on the score range (see Figure ). The SEEs for the three functions are very close to each other, though, and therefore, one could not choose a method solely based on these SEEs values. 16

23 SEE T L C X-SCORE Figure. The SEE T, SEE L, and SEE C. Study 1. NEAT design with an external anchor. We then used (4) (8) to compute the SEED for each pair of linear equating functions (Figure 3). From the results plotted in Figure 3, we might conclude that the accuracy of the difference between the Levine and chain linear functions is very high in the middle of the score range, but relatively low for the lower and upper score range. In contrast, the accuracy of the difference between the Tucker and chain functions and Tucker and Levine functions varies less across the score range, being relatively high in the middle of the score range. Since the accuracy of estimating the parameter π is the same for the three equating functions and the vector of the first derivatives of the linear function is also the same for the three equating functions, this plot reflects the differences in the pairs of the Method Functions (more exactly, the differences in the first derivatives of these functions) see also Corollary 1. 17

24 SEED T-L T-C L-C X-SCORE Figure 3. The standard error of equating differences for three equating functions. Study 1. NEAT design with an external anchor. Figures 4 6 plot the difference between two linear equating functions together with the corresponding ± SEED. In these three cases, the differences between the three functions (about half of a raw score point or more see also Figure 1) are statistically significant relative to the SEEDs. It appears that the Levine and the chain functions agree only at the very low end of the score range. As mentioned before, the SEEDs reflect the uncertainty in these differences that are due to the estimation of the parameters (the means, the variances, and the covariances in the two samples) as well as to the differences in the Method Functions. 18

25 SEED(T,L) T-L *SEED(T,L) -*SEED(T,L) X-SCORE Figure 4. The difference between Tucker and Levine together with a band of ±SEED T, L. Study 1. NEAT design with an external anchor. 19

26 SEED(T,C) -0.3 T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 5. The difference between Tucker and chain linear together with a band of ±SEED T,C. Study 1. NEAT design with an external anchor. 0

27 SEED(L,C) L-C *SEED(L,C) -*SEED(L,C) X-SCORE Figure 6. The difference between Levine and chain linear functions together with a band of ±SEED L, C. Study 1. NEAT design with an external anchor. In conclusion, we observe that the differences between the three equating functions are statistically significant. In other words, with the help of the SEED, we can distinguish between noise and real differences between the analyzed functions. Study The SEEDs are asymptotic results, so it is of interest to investigate how they vary with sample size. Sample sizes of 10,000 are relatively large, and therefore, the estimation of the parameters is relatively accurate. As a consequence, the ±SEED band will be very narrow. Study examined the following research questions: What is going to happen to the SEED when the sample sizes get smaller? 1

28 For which N will the ±SEED band be about half of a raw score point (a difference that matters)? For which N will the ±SEED band encompass the difference between the equating functions? More precisely, for which N will the SEED not be able to detect that the equating functions differ statistically? We resampled seven samples of sizes 5,000;,500; 1,700; 800; 400; 00; and 100 for each group of students from P (those who took (X, V)) and Q (those who took (Y, V)), respectively. These samples were independent random samples, drawn without replacement from the original N = 10,634 for (X, V) and M = 11,31 for (Y, V). Sorted, uniformly distributed random numbers between 0 and 1 (including 0 and 1) were generated in Microsoft Excel using the RAND function. The steps we used for sampling are as follows for each sample size within each population: 1) Assign a random number from the uniform distribution between 0 and 1 to each case (person) in the group. There are N cases in the first group. ) Sort these N random numbers. 3) The first N S cases (where N S is size for the new sample and N S is less or equal to N) are chosen to be included in the new sample. We repeated the same procedure for the second group. The summary statistics for X, Y, and V in P and Q in the new samples are given in Tables a and b.

29 Table a Summary Statistics for the Distributions of X, Y, V in P and V in Q, for the Samples With Different Sample Sizes, N S and M S N = 10,634 N S = 5,000 N S =,500 N S = 1,700 M = 11,31 M S = 5,000 M S =,500 M S = 1,700 µ XP XP µ VP VP XVP, ; µ YQ YQ µ VQ VQ YVQ, ; Moreover, we took care to preserve the same sign as that for the differences of the means in the two samples. We also took care to approximately preserve the same effect sizes (with respect to the difference in the ability in the two populations as measured by the anchor) across the samples (for example, we resampled a second set of samples of size 100 in order to preserve the same sign for the differences of the means in the two samples). It is important to note that, although the resampling was carefully carried out, by having smaller samples the parameter estimates will fluctuate around the values in the original samples. This measurement error will also have an effect on the computation of the equating functions, and their differences, respectively. 3

30 Table b Summary Statistics for the Distributions of X, Y, V in P and V in Q, for the Samples With Different Sample Sizes, N S and M S N S = 800 N S = 400 N S = 00 N S = 100 M S = 800 M S = 400 M S = 00 M S = 100 µ XP XP µ VP VP XVP, ; µ YQ YQ µ VQ VQ YVQ, ; Figures 7 to 13 plot only the differences between the Tucker and the chain linear equating together with ± SEED T,C. Study focuses on the SEED s behavior for small and medium sample sizes, and therefore, for this purpose it doesn t matter on which functions we focus. The results for the differences between Tucker and Levine functions are similar to the results for Tucker and chain linear, while the results for the differences between chain linear and Levine are as in Figure 6. Each figure illustrates one sample size, with N S = M S = 5,000,,500, 1,700, 800, 400, 00, and 100, respectively. We notice that when the sample sizes are small, the uncertainty related to computing the equating functions is large relative to the difference in the two functions (from in the original sample see Figures 3 and 5 to when N S = M S = 100 in Figure 7). Hence, with a sample size of 100 available, we would conclude that the differences in the two equating functions are not statistically significant. Moreover, the ± SEED T,C, in absolute value, is larger than a difference that matters. 4

31 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 7. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N = 100. NEAT design with an external anchor. For a sample size of 00, the differences between the Tucker and chain functions are statistically significant and the ± SEED T,C is about the size of a difference that matters (see Figure 8). However, at the lower and upper score range, the difference between the two equating functions is inside the band provided by the ± SEED T,C. One of the reasons is that the accuracy is lower at extremes of the score range. 5

32 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 8. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N = 00. NEAT design with an external anchor. For a sample size of 400, the differences between the Tucker and chain functions are statistically significant over most of the score range, and the SEED T,C is about the size of a difference that matters (see Figure 9). 6

33 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 9. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N = 400. NEAT design with an external anchor. For all the larger samples, the differences between the Tucker and chain functions are statistically significant over all of the score range (see Figures 10 13). 7

34 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 10. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N = 800. NEAT design with an external anchor. 8

35 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 11. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N = 1,700. NEAT design with an external anchor. 9

36 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 1. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N =,500. NEAT design with an external anchor. 30

37 SEED(T,C) T-C *SEED(T,C) -*SEED(T,C) X-SCORE Figure 13. The difference between Tucker and chain linear functions together with a band of ±SEED T, C. Study, N = 5,000. NEAT design with an external anchor With a sample size of 00 available, we conclude that the two equating functions significantly differ for most of the score range. For larger sample sizes, we notice that the accuracy increases (i.e., the SEED T,C in absolute values decreases), and one can conclude that the differences between the two functions are statistically significant. It follows that for this data set, a sample size of 00 seems to be enough for the SEED to detect that the two equating methods, Tucker and chain, differ statistically. The level of accuracy is slightly decreased for this small sample size. More studies are necessary to investigate the SEED behavior in small and medium samples. Given that in most of the practical equating situations the sample sizes are much larger, the SEED probably will detect whether the differences between the equating methods are significant. 31

38 In von Davier (003), several idealized conditions are described when the three methods will give the same results. However, in practical applications, each of these conditions holds more or less. In a real life situation, when plots like those from Figures 4 6 indicate that the differences between the methods are statistically significant, which of the methods should one choose? From a score-reporting point of view, it does matter which method one would choose in this example because the differences between the results from the Tucker method and the others do have an impact on the final results. (These differences are larger than half a raw score point for most of the raw-score range of X.) From Study 1, we can conclude that Tucker is far away from the other two equating methods and that the chain linear is in between Tucker and Levine ( Lin ( x;mf ( )) <Lin ( x;mf ( )) <Lin ( x;mf ( )) π π π ). Moreover, XY ; T T XY ; T CL XY ; T L all observed differences are statistically significant. We cannot make the decision about the final equating function using the SEED alone, if each of the equating methods relies on a different set of assumptions. We also cannot resolve the choice between the methods by directly checking their assumptions (T1 T, L1 L3, C1 C3) against the data, since these assumptions are not directly testable. In a practical situation, one will also investigate the issues related to the possible nonlinearity of the appropriate equating function. In addition, one should also investigate the SEE for each equating function. The equating results with a higher accuracy (smaller SEEs) should prevail. However, in Study 1, the differences in the SEEs were very small and therefore, it would be difficult to use them for making the decision. As mentioned before for this example where the two populations seem to be dissimilar, using the rules and discussion previously presented, one would choose the Levine equating function (when choosing between Tucker and Levine methods), since usually the test forms are very carefully constructed in this assessment program. Hence, the final decision would appear to be between the Levine and the chain functions. At this point, one s belief in the plausibility of each set of assumptions appears to be the sole basis left for making this important judgment (von Davier et al., 004, p. 194). Further research in this area is necessary. The advantages of the SEED are outlined in the next section. 3

39 Discussion This paper takes a new perspective on linear equating. It introduces a unified approach to linear equating in the NEAT design by developing a common parameterization that allows one to emphasize the similarities between different methods. Based on this common parameterization, we claim that there is only one definition of observed-score linear equating in the NEAT design, given in (1), which might take different forms under different assumptions. We use a new concept, the Method Function, to distinguish among the possible forms that a linear equating function in a NEAT design might take (in particular among the three equating methods investigated here Tucker, Levine, and chain linear equating). By using this approach, the SEE formula and concept also becomes unified, covering all of the particular equating functions. The new approach to linear equating provides a better understanding of equating in general as well as of the SEE. This view is provided here for the first time (to our knowledge). The new formula for the SEE makes a computer program more efficient. We also present a new tool, the standard error of equating difference (SEED), to investigate if the observed difference in equating functions is statistically significant. Although the SEED is an asymptotic result, it seems to be stable enough to detect the differences in a sample size of 00 for the data investigated here. Additional studies might be necessary to describe the behavior of the SEED for small and medium sample sizes for different data. The SEED provides an additional measure to consider when making decisions about the final equating function, especially for medium sample sizes. It is important to know if the observed differences between two equating functions are statistically significant or they reflect only random errors. This issue was extensively investigated in empirical studies, and as Harris and Crouse (1993, p. 19) conclude: Perhaps the most common process followed in conducting an equating study is to apply a series of equating methods to a particular situation. Usually all that can be concluded from such a comparison is whether the methods appear to be providing similar or dissimilar results, and even that cannot be determined with any accuracy, because one generally does not have a baseline by which to judge if the differences between results are simply the result of random error, or something else. 33

40 The SEED is exactly the answer to the second part of Harris and Crouse s remark: The SEED can tell if the observed differences are the result of random error or not. While it does not solve the problem of how to decide between different equating functions, it is a step forward in providing more insight and information that one can use when making this decision. Harris and Crouse (1993) reviewed all criteria and methods that researchers had developed for improving this decisional process up to Three other methods can be considered: 1. Investigating how sensitive each of the equating functions is to the population invariance assumption (see Dorans & Holland, 000; von Davier et al, 003). The method introduced in von Davier et al. (003), though promising, needs additional research.. Carrying out a score equity analysis proposed in Dorans (003). This is also an approach to the study of population invariance, but it focuses on different issues: specifying the number of subpopulations that should be investigated, checking if the subpopulation score distributions are similar, computing the standardized difference between the means in the important subpopulations, and using the Dorans and Holland measure (000) to investigate the population invariance of the equating function. 3. Comparing the first several moments of the distribution obtained through equating with those of the distribution of the old form (the targeted distribution see von Davier, et al., 004, chapter 4). It is also worth noting that a similar approach as outlined here (with general formulas for the SEE and SEED) is being developed to investigate the differences between linear and nonlinear equating functions in the framework of the kernel method of test equating (see von Davier et al., 004). A similar SEED formula is not feasible for the classical equipercentile equating (which uses a linear interpolation as a continuization procedure) because the resulting equating function is not continuously differentiable at the extreme of the linear segments (and therefore, the delta method cannot be applied). Bootstrap SEED might be conceived for this situation, which might be a very interesting issue for further research. 34

41 References Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service. (Reprinted from Educational measurement, nd ed., pp , by R. L. Thorndike, Ed., 1971, Washington, DC: American Council on Education.) Braun, H. I., & Holland, P. W. (198). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9 49). New York: Academic. von Davier, A. A. (001). Testing unconfoundedness in regression models with normally distributed variables. Aachen: Shaker Verlag. von Davier, A. A. (003). Notes on linear equating methods for the Non-Equivalent Groups design (ETS RR-03-4). Princeton, NJ: Educational Testing Service. von Davier, A. A., Holland, P. W., & Thayer, D. T. (004). The kernel method of test equating. New York: Springer Verlag. von Davier, A. A., Holland, P. W. & Thayer, D. T. (003). Population invariance and chain versus post-stratification methods for equating and test linking. In N. Dorans (Ed.), Population invariance of score linking: Theory and applications to Advanced Placement Program Examinations (ETS RR-03-7). Princeton, NJ: Educational Testing Service. von Davier, A. A., Holland, P. W. & Thayer, D. T. (in press). The chain and post-stratification methods for observed-score equating: Their relationship to population invariance. Journal of Educational Measurement. Dorans, N. J. (003, May 16). Score equity analysis. Paper presented at the Ledyard R. Tucker Psychometric Workshop, Educational Testing Service, Princeton, NJ. Dorans, N. J., & Holland, P. W. (000). Population invariance and equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, Hanson, B. A., Zeng, L., & Kolen, M. J. (1993). Standard errors of Levine linear equating, Applied Psychological Measurement, 17, Harris, D. J., & Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6(3), Holland, P. W., King, B. F., & Thayer, D. T. (1989). The standard error of equating for the kernel method of equating score distributions (ETS PSRTR-89-83, ETS RR-89-06). Princeton, NJ: Educational Testing Service. 35

Research on Standard Errors of Equating Differences

Research Report Research on Standard Errors of Equating Differences Tim Moses Wenmin Zhang November 2010 ETS RR-10-25 Listening. Learning. Leading. Research on Standard Errors of Equating Differences Tim