EVALUATION OF MATHEMATICAL MODELS FOR ORDERED POLYCHOTOMOUS RESPONSES. Fumiko Samejima*

Size: px

Start display at page:

Download "EVALUATION OF MATHEMATICAL MODELS FOR ORDERED POLYCHOTOMOUS RESPONSES. Fumiko Samejima*"

Rodger Hugo Dean
6 years ago
Views:

1 EVALUATION OF MATHEMATICAL MODELS FOR ORDERED POLYCHOTOMOUS RESPONSES Fumiko Samejima* In this paper, mathematical modeling is treated as distinct from curve fitting. Considerations of psychological reality behind our data are emphasized, and criteria such as additivity in a model, its natural generalization to a continuous response mode], satisfaction of the unique maximum condition and orderliness of the modal points of the operating characteristics of the ordered polychotomous responses are proposed. Strengths and weaknesses of mathematical models for ordered polychotomous responses that include the normal ogive model, the logistic model, the acceleration model and the family of ordered polychotomous models developed from Bock's nominal model are observed and discussed in terms of such criteria. It was concluded that it will be better to leave Bock's model as a nominal model as he intended it to be, without expanding it to ordered polychotomous models. 1. Introduction It is a widely used approach that, using a certain statistical method, or methods, out of several different mathematical models, a researcher decides which model fits best to the set of data in question, and accepts the best fitted model. Although it has been used as a standard procedure, this mechanical application of a statistical method involves certain serious problems which may lead to wrong decisions. There is a high probability that a model which provides varieties of different shapes of curves be selected regardless of the principle and assumptions behind the model, that is, they may not agree with the psychological background of our data ; then the procedure leads to a simple curve fitting, which is distinct from mathematical modeling. Thus it is important to identify a family of models that can be substan tively justified, before using such a statistical method. Samejima (1972) proposed a general latent trait model for graded, or ordered polychotomous, responses, and in this framework distinguished the homogeneous case and the heterogeneous case. In the same paper, she also pointed out that a family of ordered polychotomous response models can be developed from Bock's nominal model (Bock, 1972) which belongs to the heterogeneous case, with a restriction that one of the two parameters in the model be arranged in an ascending order of the polychotomous item score. Samejima (1979) did not pursue graded response models that could have been expanded from Bock's nominal model, Key Words and Phrases ; latent trait models, item response theory, ordered polychotomous responses, categorical data, mathematical modeling, graded response models, partial credit models, nominal models, continuous response models Department of Psychology, University of Tennessee, Knoxville, Tenn , same psychl.psych.utk.edu

2 however, because of the fact that Bock's nominal model is based on choice behavior and that the assumption intrinsic in the model does not fit typical ordered poly chotomous response situations. Later, however, Masters (1982) and Muraki (1992) proposed the partial credit model and the generalized partial credit model, respec tively, for ordered polychotomous responses, which are special cases of Bock's nominal model satisfying Samejima's condition. For brevity, those models for graded responses extended from Bock's nominal model will be called extended Bock models. Thissen & Steinberg (1986) pointed out that many seemingly disparate models proposed for multiple responses may be considered as generalizations or special cases of each other, and Samejima's (1969) graded and Bock's (1972) nominal proposals remain the only distinct approaches. In that paper, what Thissen & Steinberg meant by Samejima's graded response model is actually a subset of models in the homogeneous case represented by the normal ogive and logistic models (Samejima, 1972). Thissen & Steinberg's naming, difference models, as opposed to divided-by-total models represented by Bock's nominal model, is still applicable, however, for Samejima's general model of ordered polychotomous responses which includes the heterogeneous case as well as the homogeneous case. Samejima (1995) proposed the acceleration model which belongs to the hetero geneous case. In the present paper, strengths and weaknesses of mathematical models for ordered polychotomous responses that include the normal ogive model, the logistic model, the acceleration model, and the family of ordered polychotomous models developed from Bock's nominal model will be discussed in terms of criteria other than the goodness of fit of the curves, and some conclusions will be reached. 2. Curve fittings and mathematical modeling In this paper, the term curve fitting is used for an application of a statistical method, or methods, for evaluating discrepancies between the empirically obtained curves and the ones provided by a parametric model. The reasons why curve fittings should not be over-emphasized in model selection include : (a) Goodness of fit of the curves to the set of data is a necessary condition for justifying the use of a specific model, but not a sufficient condition. (b) Two or more mathematical models based on substantially different principles, and thus with parameters of substantially different mean ings, may provide almost identical sets of curves regardless of their differences in philosophy. (c) Poor fit of curves can be realized not because of the inappropriateness of the model, but because of deficiencies or limitations in the adopted computer software. To illustrate (b), Figure 1 presents six operating characteristics following the

3 acceleration model, and Figure 2 also presents six operating characteristics follow ing Masters' partial credit model, which belongs to the family of extended Bock models to deal with ordered polychotomous responses, and is based on substantially different principles from those behind the acceleration model. It is obvious that the two sets of operating characteristics shown in Figures 1 and 2 are practically identical, regardless of the differences in philosophies behind the two models. This implies that, if one set of curves fits some data set well, then the other set of curves will fit just as well, provided that software for the second model is written as well as that for the first model, the fact that materializes the statements (b) and (a). For the reasons described above, differences in goodness of fit provided by Fig. 1 A set of operating characteristics of six steps in the acceleration model, with axe= , , , , , (3X8= , , , , and axe= , , , , for x,= 1, 2, 3, 4, 5, respectively. Fig. 2 A set of operating characteristics of six discrete responses following Bock's nominal model converted to a graded response model, or Masters' partial credit model, with ax,, =1, 2, 3, 4, 5, 6 and /-1xe=1.0, 2.0, 3.0, 3.5, 1.8, 1.0 for xg=0, 1, 2, 3, 4, 5, respectively.

4 separate mathematical models should not be over-emphasized in model selection, or, otherwise, it is very possible that we falsely select a model which does not represent the nature of our data. Continuation of the use of such a model in further research will eventually be confronted by problems. It is inconceivable that a single mathematical model be appropriate for all sets of data with varieties of different psychological backgrounds ; and yet a wrong model selection procedure may lead to such a conclusion as long as the goodness of fit of the curves is used as a sole criterion for model selection. In mathematical modeling, therefore, the fit of the principles behind a model, rather than the curves it provides, to the psychological reality on which our data are based is most important. If the fit is very poor for a model that seems to be appropriate in principle, then there will be room for reconsideration. If the fit of the curves provided by a model to our data is reasonably good, or fair, however, the model should not be discarded ; criteria other than goodness of fit should be seriously considered in order to make a right decision in model selection. 3. Principles behind the models Let 0 be the latent trait, or ability, which represents a construct hypothesized behind certain human behavior, and is assumed to take on any real number. Let g denote an item, which is the smallest manifest entity for measuring 0. Let Xg be a graded item response to item g, and xg(=o, 1,.., mg) denote its realization. Note that these non-negative integers are arbitrary, and will never be used directly in ability estimation, unless sequences of item scores, or response patterns, are sum marized into the test score and it is used as a substitute for ability with a loss in the amount of test information (Samejima, 1969) and thus in accuracy of ability estima tion. If the reader prefers, therefore, he/she can use, say, A, B, C, etc., instead of 0, 1, 2, etc. The operating characteristic, P,,(0), of the item score xg indicates the conditional probability, given 0, with which the individual of ability 0 obtains the item score xg, that is, Pxg(B)=prob.[Xg=xg 10]. This operating characteristic is assumed to be five times differentiable (Samejima, 1993a, 1993b) with respect to 0. For convenience, hereafter, xg will be used both for a specific discrete response and for the event Xg=xg, and a similar usage is applied for other symbols. The fundamental framework of the general graded response model (Samejima, 1972) is given by Pxg(e)= It Mu(e)[1-M(xg+1)(0)], (3.1) usxg where M,,(8), called processing function (Samej ima, 1995) of the step xg (=1, 2,, mg),

5 which is the joint conditional probability with which the individual clears the step xg, under the conditions that : (a) the individual's ability level is 0, and (b) the steps up to (x,,-l) have already been cleared. The processing function is assumed to be non-decreasing in 0. Let (mg+1) be the hypothesized graded item score adjacent to and above mg. Since, regardless of 0, everyone can at least obtain the item score 0, and no one is able to obtain the item score (mg+1), it is reasonable to set =1 for xg=0 Mxg(e) =0 for x g=mg+1, for all 0. Let Pg(0) denote the cumulative operating characteristic (Samejima, 1995), which is the conditional probability with which the individual of ability 0 clears at least the step xg. Thus P g(8)=prob.[xgzxg[0]=ii usxg Mu(0). (3.2) From (3.1) and (3.2) the operating characteristic Pxg(e) can be written as Pxg(B)=P g(b)-i (xg+,)(e). (3.3) Thissen & Steinberg's naming (Thissen & Steinberg, 1986), difference models, comes from the above Eq. (3.3). It should be noted that the general framework represented by (3.1) is not restricted to sequential processes. Take a Lickert type categorical judgment, for example. When we select one of the four response categories, strongly disagree, disagree, agree and strongly agree, to a given statement in social attitude measure ment, usually one does not compare his/her beliefs with each of the consecutive categories starting from the bottom. And yet selection of a specific response category implies such comparisons, and Mxg(e)'s for those xg's implicitly exist. It should also be noted that in sequential processes surpassing the step xg may not be explicit for all individuals. This is exemplified by the fact that some bright individ uals seemingly skip lower steps in the sequence and go directly to higher steps. The general model represented by (3.1), (3.2) and (3.3) leads to two separate cases, that is, the homogeneous case and the heterogeneous case. Models in the homogeneous case are characterized by the identical shapes of the cumulative operating characteristics, Pg(e)'s, for xg=1, 2,..., m,; these mg functions are positioned alongside the abscissa in accordance with the order of the item scores xg. Note that the distances between the two adjacent curves, Pg(e) and P(Xg+,)(0) for xg =1, 2,, xg -1, may be different for separate pairs. Thus for a model in the homogeneous case Pg(e) can be expressed as

6 e(b-bx e ) P g(b)f = G(u)du, (3.4) where cb(u) is a specified density function, ag(>o) is the discrimination of item g, which is common to all responses parameter for the item score xg satisfying parameter to item g, and bxg is a location -c=bo<b1<b2<.. <b mg<bmg+1=. (3.5) Note that from (3.2), (3.4) and (3.5) it is obvious that Mxg(O) of any model which belongs to the homogeneous case assumes unity for xg = 0 and zero for xg = mg + 1 for all 0. Two examples of this family are the normal ogive and logistic models (Same jima, 1969, 1972), whose cumulative operating characteristics are given by and g(b bxg) Pxg(0) (3.6) 2 f exp [ 2 j du P g(b) l+ exp [-Dag(e-bxg)] ' (3.7) respectively, where bxg's for xg=0, 1,, mg, mg +l satisfy (3.5), and the scaling factor D in (3.7) is usually set equal to 1.7 in order to make these cumulative operating characteristics close enough to those in the normal ogive model (see Birnbaum, 1968). Eq. (3.6) and (3.7) are special cases of (3.4) where O(U) is specified by the standard normal density function and the logistic density function that accomodates D, respectively. It should be noted, however, that the cumulative operating characteristics, P g(8)'s, in the homogeneous case do not have to be point-symmetric, that is, the relationship P g(bg+j0)=l-pg(bg-j0), where a8 is any increment or decrement of 0, does not have to hold, as it does in the normal ogive model and the logistic model. Some asymmetric examples have been shown elsewhere (see Samejima, 1972), and more general observations and discussion concerning asymmetric P g(9)'s are made in a separate paper (Samejima, in preparation). It has been observed (Samejima, 1972) that, in spite of the similarity between the two sets of mg cumulative operating characteristics in the normal ogive and logistic models, philosophies behind their processing functions are characteristically different. Figures 3 and 4 illustrate processing functions as well as cumulative operating characteristics in the normal ogive and logistic models, respectively, with mg=5 and the common item parameters ag=1.0, b1=-3.5, b2=-3.0, b3=-2.0, b4= 0.0 and b5=3.0. A characteristic difference between these two models lies in the lower asymptotes of their processing functions. In the normal ogive model, this

7 Fig. 3 A set of processing functions and corresponding cumulative operating characteristics for xg = 1, 2, 3, 4, 5, following the normal ogive model with the parameters ag = 1.0 and bxr= -3.5, -3.0, -2.0, 0.0, 3.0, respectively. Fig. 4 A set of processing functions and corresponding cumulative operating characteristics for xg=1, 2, 3, 4, 5, following the logistic model with the scaling factor D=1.7 and the parameters ag =1.0 and bxa= -3.5, 3.0, 2.0, 0.0, 3.0, respectively. asymptote equals zero for every x,(= 1, 2,.., mg), whereas in the logistic it is given by lim Mxg(O)=exp [-Dag(bxg-bxg_1)], which assumes zero for xg =1 and a positive value for xg = 2, 3,.., mg, and this value increases as the distance between the two adjacent difficulty parameters decreases. This difference will be discussed further in a later section. By the heterogeneous case of the graded response model we mean a family of models in which not all P g(o)'s for x,=1, 2,..., mg are identical in shape. Note that (3.3) applies to models in the heterogeneous case, as well as those in the homogeneous case. The acceleration model (Samejima, 1995) is an example of this family of models. In this model Pxg(O) is given by Pxg(O)=II [Pu(e)] u[1-[?(xg+1)(e)]exg+1], u I X, (3.8)

8 where Exg( > 0) is the step acceleration parameter, and V,,(0) in (3.8) may be specified by Pxg(e)= 1+ 1 exp [-Daxg(0-,3xg)]' (3.9) where D=1.7, and ax,( > 0) and Qxg are the discrimination and location parameters, respectively. When (3.9) is adopted for Vxg(e), the processing function becomes 1 sxg Mxg(e)= 1+ exp [-Daxg(e-axg)] J 1 (3.10) In the example of the acceleration model illustrated in Figure 1, (3.9) was used for Wxg(e) and the parameter values are : axg= , , , , ,3xg= , , , , ~xg= , , , , , for xg = 0, 1, 2, 3, 4 and 5, respectively. The acceleration model has been proposed, basically, as a model for sequential cognitive processes, such as those in problem solving. In this specific application, it is assumed that there are more than one observable step in the entire cognitive process. Graded item scores, or partial credits, 1 through mg, are assigned to the successful completions of these separate observable steps. It is also assumed that, within each step, there are sxg(> 1) subprocesses wxgl, Wxg2,, Wxgsxg, which may or may not be observable. These subprocesses within a step contribute to successful completion of the step through their own subprocess acceleration parameters $Wxgi > 0 (i=1, 2, --, sxg), through sxh ExgEwxgi. (3.11) It is obvious from (3.11) that the processing function Mxg(e) can be expressed as the product of the sxg subprocess processing functions within the step xg. Let M(wx gh) (0) be the incomplete step processing function after the h-th subprocess within the step xg has been cleared, where Thus 1 << h < sxg. 1+exp [-Daxg(B-~3xg)] ] M(wgh)(0)-_ 1 Eh ~~wxgi. (3.12) From (3.12), the first and second partial derivatives of M(wx gh)(0) with respect to 0 can be written as and a9 M(wxgh)(0)= Z=~ ~wxg:daxg[ xg(e)]e"-,iwxgt[1?xg(0)] >0

9 aea2 2 M(wxgh)(0)= Z-~ h swxgid2axb[ xg(e)]e%,;wxe, h ~1 Wxg(e)][h Ewxgt{1 Wxg(e)} 'xg(e)], (3.13) respectively, where?p'xg(e) is given by (3.9). Setting (3.13) equal to zero, Ohdmax, the value of 0 at which M(wxgh>(B) is most discriminating, is obtained by ~1 h Bhdmax W41 1+'h C ~z=1 cw Xe: z=1 wxgi (3.14) It is obvious from (3.9) and (3.14) that Bhdmax increases as a greater number of subprocesses within the step have been cleared. Figure 5 illustrates the change of Bhdmax for hypothetical 5 subprocesses within a step, for which axg=1.0 and 8x,=0.0 in (3.9) and ~wxe,'s are 0.25, 0.25, 0.25, 0.25 and 0.50 for i=1, 2, 3, 4, 5, respectively. The values of Bhdmax for h=1, 2, 3, 4, 5 were obtained from (3.14) and turned out to be , , 0.169, and 0.239, respectively, and the corresponding values of M(wxeh)(e) at 0= Bhdmax are 0.669, 0.577, 0.530, and Thus if $wxb, assumes a large value, then the contribution of wxg= to the success ful completion of the step xg will be large in the sense that it accelerates the sum of the subprocess parameters to $xg and the value of Bhdmax ; if it is small, then its contribution will also be small. Note that $wxgi can be zero, without contributing in accelerating the sum of the subprocess parameters or Bhdmax. To give an exam ple, in proving the cosine law, one step is to use Pythagoras' Theorem. In this step, a subprocess to draw a perpendicular line to a side from the opposite angle is included. It is considered that anyone who thinks of using Pythagoras' Theorem can draw such a line. This implies that Ewxg; for this subprocess i is zero, that is, no contribution of the subprocess wxgi to the step acceleration parameter ~xg is Fig. 5 Change of Bhdmax for hypothetical five subprocesses within a step, with the step discrimi nation and location parameters a,,= 1.0 and /3x8=0.0 and the subprocess acceleration parameters ~a xa =0.25, 0.25, 0.25, 0.25, 0.50 for i =1, 2, 3, 4, 5, respectively.

10 provided. If two or more steps, or sequences of steps, are reversible in order, the steps are said to be parallel, as distinct from serial steps. The same logic applies for subprocesses within a step, that is, parallel subprocesses are those which are reversible in order and serial subprocesses are those which are irreversible in order within the step. It is assumed that for any number of parallel subprocesses the subprocess acceleration parameters are unchanged by reversals of their sequential orders, so that the step acceleration parameter given by (3.11) be unaffected. In Bock's nominal model (Bock, 1972) the operating characteristic is given by Pkg(e)= ~ exp [akgo+/3kg] L.~uEKg exp [aue+,3u]' (3.15) where kg denotes a nominal response to item g, Kg is a specific subset of responses selected from the total answer space, as exemplified by the set of the correct answer and several distractors in the multiple-choice test item, and akg(> 0) and /3k g are item response parameters. It is obvious from (3.15) that the operating characteris tic Pkg(O) in Bock's nominal model depends on the specific subset of responses to item g, for the denominator of (3.15) is the sum total of the numerators of the operating characteristics of all kg E Kg, whereas the numerator stays unchanged regardless of the choice of the subset. Thissen and Steinberg (1986) called this family divided-by-total models, the naming stemming from (3.15). It is obvious from (3.15) that invariance exists in the conditional ratio of the operating characteristics, given 0, of any pair of responses kg and hg, regardless of the choice of the subset Kg from the answer space, to which kg and hg belong. conditional ratio is given by This Pkg(e) Phg(8) exp [akge+/3kg] exp [ ahg 0 +,3hg] exp [(akg-ahg)e] exp [,8k,,-Nkg], (3.16) which solely depends on the parameters of the two responses kg and hg, the principle similar to the one behind individual choice behavior (Luce, 1959). It has been shown (Samejima, 1972) that the model can also be considered graded response model in the heterogeneous the parameter ax, satisfies as a case if kg is replaced by xg in (3.15) and ao<-al<-as<...<amg, (3.17) where a strict inequality should hold at least at one place. If this condition is satisfied, then the processing function is given by Mxg(B) 1Mg xgg P[La e+~a] u] (3.18) Samejima (1979) found it difficult to extend Bock's nominal model to an ordered polychotomous model, however. A big difference between Samejima's general graded response model represented by (3.1) and Bock's nominal model represented

11 by (3.15), or between difference models and divided-by-total models, is that the borderlines or thresholds of adjacent item response categories are parameterized in the former whereas item responses themselves are parameterized in the latter. The implicit invariance assumption represented by (3.16) is acceptable only when kg and hg are solid discrete entities, that cannot be more finely classified nor combined with another response or responses, although these characteristics are required in many typical ordered polychotomous response situations. Later, however, Masters (1982) proposed his partial credit model and Muraki (1992) proposed his generalized partial credit model, both of which are special cases of Bock's nominal model that are converted to ordered polychotomous models satisfying (3.17) with strict inequal ities at all places. In Masters' partial credit model, axg is given by axg=xg+l for x,=0, 1,, mg. (3.19) In the example illustrated in Figure 2, the values of /3Xg are: 1.0, 2.0, 3.0, 3.5, 1.8, 1.0 for xg=0, 1, 2, 3, 4, 5, respectively. In Muraki's generalized partial credit model, the operating characteristic has an additional discrimination parameter a,(> O), and axg=(xg+l)ag for xg=0, 1,..., mg. 4. Typical ordered polychotomous responses Typical ordered polychotomous responses are identified in: (1) categorical judgment which was exemplified earlier, (2) rating scales exemplified by letter grading (e.g., A, B, C, D and F) of academic performance, (3) partial credit given in accordance with an individual's level of closeness to a specific goal, which is exemplified by a cognitive process like problem solving, etc. Each situation has somewhat different characteristics, and selection of a model, or models, in each case should be made with its specific psychological background in mind. It can be seen, however, that there are certain characteristics which are com mon among the above typical ordered polychotomous response situations. First of all : *Those ordered polychotomous categories are more or less arbitrary. To give a couple of examples, for a required college course the letter grades, A, B, C, D and F, may be changed to Pass and Fail, setting the borderline between, say, B and C ; also for a statement in social attitude measurement, a dichotomous response format, a 5-point scale format, a 7-point scale format, etc., and even a continuous response format are used. Another example can be seen in cognitive assessment. With the advancement of computer technologies, it is quite possible to obtain more abundant information from the individual's performance in computer ized experiments with constructed responses as we proceed in research, which will result in increment of the number of ordered categories.

12 The above examples indicate that there are two directions, that is, 1. finer recategorizations of responses, and 2. combinations of two or more adjacent categories. It is noted that arbitrariness of ordered polychotomous response categories includes two different situations, with respect to the thresholds between response categories. (a) Fixed threshold situation. The example of redichotomizing the letter grades, A, B, C, D and F into Pass and Fail belongs to this situation, and so does the case of reducing the number of 5-point response categories into 2 categories in data analysis. Also the example of more precise observation of a cognitive process cited earlier belongs to this situation. (b) Flexible threshold situation. An example can be seen when B+ is added to the set of letter grades, A, B, C, D and F and the protocols are regraded using the resulting 6 categories. It is likely, for example, that the threshold between B and C will be shifted to the negative direction. Similar shifts will occur in attitude measurement if the set of 5 response categories, strongly disagree, disagree, neutral, agree and strongly agree, is changed to that of the 4 response categories by deleting neutral, and data are collected again. Additivity intrinsic in a model is defined as the characteristic of the model which provides the operating characteristics that belong to the same model, that is, the mathematical form of the resulting operating characteristic(s) is the same as that of the original operating characteristic(s) for both more finely categorized responses and combined category responses, in either of the two threshold situa tions. If additivity does not hold for a model, then the operating characteristics provided by the model will heavily be affected by incidental factors such as the number of response categories adopted in the protocols, etc. Thus additivity is an important feature of models that can be adopted for typical ordered polychotomous responses, and will be discussed further in the subsequent section along with several other features. 5. Criteria for model evaluation The first criterion in evaluating a model should be to find out if the principle behind the model and the set of accompanied assumptions agree with the psycholog ical processes presumed to underlie the data. Without satisfying this criterion, mathematical modeling cannot be realized, and the research could end up with mere curve fittings, producing meaningless item parameters.

13 From the observations made in the preceding section, in typical ordered poly chotomous situations, additivity intrinsic in a model is required, and this will legitimately be adopted as the second criterion for evaluating ordered poly chotomous response models. As the number of ordered polychotomous categories increases, the situation approaches a continuous response case as the limiting situation. It is desirable that such a natural generalization to a continuous response model holds, and this will be the third criterion, which is a natural extension of additivity. In estimating the individual's ability level from his/her response pattern, it is desirable that its likelihood function has a unique modal point, or, otherwise, multiple maximum likelihood estimates of his/her ability will be resulted. Same jima (1969, 1972) proposed a sufficient condition for a unique maximum such that abaxg(0)<0 (5.1) is satisfied for all 0 except, possibly, for an enumerable points, and lim Axg(B) > (5.2) and lim Axg(0)<<-0, B-co (5.3) where Axg(e) is called the basic function and given by Axg(e)= ab log Pxg(e). For brevity, the above set of joint conditions is called unique maximum condition. It can be seen from (5.1) through (5.3) that, if a model satisfies the unique maximum condition, then a single local or terminal maximum is provided to the operating characteristic of the item score. It will also assure that the likelihood function of any response pattern consisting of such responses has one and only one local or terminal maximum. Thus satisfaction of the unique maximum condition will be the fourth legitimate criterion for evaluating ordered polychotomous models. The inequality given by (5.1) can be replaced by = a2 IxgB) e2 log Pxg(e) > 0, where Ixg(0) is called the item response information function (Samejima, 1972, 1973b). Adding to this criterion, it will be desirable from the meaning of the item score that within a single item the model provide ordered modal points of the operating characteristics in the ascending order of the item score. Note that these operating characteristics equal the likelihood functions when a test consists of a single item, and this orderliness leads to the orderliness of the maximum likelihood estimate of

14 ability. Thus it will be the legitimate fifth criterion for model evaluation. 6. Model selection In categorical judgment, since each item is expected to have a certain solid relationship with the latent trait 0, it will be appropriate to assume that there be some invariance in the relationship between item g and the latent trait 0 regardless of the categorical thresholds. This can be realized by a common discrimination parameter for all item responses, which provides the same discrimination power when the item is dichotomously rescored by selecting one of the thresholds of two adjacent response categories. If the discrimination power of the item changes substantially by the selection of different thresholds for redichotomization even though wordings for separate categories are appropriately made, it will be doubted that the item has a solid relationship with the underlying latent trait. The same logic will be applied to many rating scales also. A model in the homogeneous case represented by (3.4), therefore, will be a decent choice. The different characteristics of the processing functions in the normal ogive model and the logistic model, which were discussed earlier, can be interpreted as the difference in relative emphases on the two joint conditions for the processing function. In certain situations it may be reasonable to assume that the processing function be close to zero at very low levels of 0 regardless of xg and of the fact that the steps up to (xg -1) have been cleared ; the normal ogive model will be more appropriate than the logistic model in these cases. In certain other situations, however, it may be more reasonable to assume that the fact that the individual has cleared up to the step (xg-1) entitles the processing function to be positive no matter how low the individual's ability level may be, and this lower asymptote be higher if the difficulty levels of the current step and of the preceding one are closer ; the logistic model will be more appropriate in such cases. Distinct from situations exemplified above, in many other situations including cognitive processes such as problem solving, homogeneity restriction in the shapes of P g(b)'s may not agree with the psychological reality ; it may be more logical to assume heterogeneous relationships of the separate steps, which lead to the problem solution, to 0. Thus a model in the heterogeneous case will be a more legitimate choice. Out of the five criteria for model evaluation, while the fit of the principles behind each model to the psychological background of the data is specific, the other four criteria, that is, additivity intrinsic in a model, generalizability to a continuous response model, satisfaction of the unique maximum condition and orderliness of the modal points of the operating characteristics, are more general in the sense that they are appropriate in most typical ordered polychotomous response situations. These four criteria will be discussed, therefore, for models in the homogeneous case, the acceleration model, and extended Bock models.

15 6.1 Additivity A strength of any model in the homogeneous case is that additivity always holds. Assuming that the item in question has a solid and straight-forward rela tionship with the latent trait 0, if q(>1) new ordered polychotomous response categories are added between xg and (xg + 1), then identical shapes of P e(8) will be preserved with or without shifts of locations of those curves of the preexisting item scores alongside the abscissa, and P g(o)'s for the q new graded response categories will also have identical shapes, with their location parameters found between bxg and bxg+,. In the flexible threshold situation, for example, if B+ is added to the preexisting letter grades, A, B, C, D and F, then the location parameters for these 5 letter grades will more or less be affected, that is, those for B and A will be substantially lowered and elevated, respectively, and the location parameter will be less and less affected as the letter grade departs from B. The location parameter for the new item score B+ will be found between those of B and A. In the fixed threshold situation, if r adjacent graded categories from xg to xg + r -1 for r < mg xg + 1 are combined, from (3.3) and (3.4) the operating charac teristic Pxg(B) will be changed to Pxg(8)=P g(e)-p(xg+r)(0) ag(b-bxg) ag(b-bxg+r) = f O(U)du f O(u)du which obviously belongs to the same, original model, and any other operating characteristic is preserved, with the shift of the item score by (-r+1) from the original value when it is greater than xg. Thus if we redichotomize the letter grades A, B, C, D and F into Pass and Fail, for example, setting the borderline between, say, B and C, the resulting operating characteristics of Pass and Fail still belong to the same model. This does not hold for all models in the heterogeneous case, however. In general, because of their more complicated mathematical forms stemming from heterogeneity, it becomes more difficult to develop a model which satisfies ad ditivity. In the acceleration model, since a subprocess wxgi affects the operating charac teristic Pxg(B) solely through Ewxgr in the way shown in (3.11), it is obvious that addition of q(>_1) ordered polychotomous categories between xg and (xg+l) will provide q operating characteristics that belong to the same, original model, that is, the first feature of additivity is fulfilled. When r(> 2) adjacent item scores are combined into one step, the second feature of additivity still holds only if the xg(b)'s for these r steps have identical axe's and 3,'s; if not, it will not hold. Robustness of the acceleration model has been observed (Samejima, 1995), however, in the latter situation. This means that a set of axg and 3xg in (3.9) is likely to be discovered for the combined step which provides almost an identical Pxg(B) with the sum of the original r operating characteristics in practical situations. An example

16 has been shown (Samejima, 1995), in which the parameters ax,,, /3Xg and EXg were estimated as the solutions of log [(p2)-l' _ 1] -log [(p3)-liexg-1] log [(pl-"'xg-1]-log [(p2)-l/exg Mxg(axg) = [+1 (6.1.1) (6.1.2) and 2`Xg+1 a axg= D~ MXg(e) at e=qxg, (6.1.3) xg ae where Mxg(O) is the product of the processing functions, M( + 's, with xg's indicat ing all adjacent item score categories to be combined, pl, P2 and p3 are arbitrarily selected three distinct probabilites arranged in the ascending order and 01, e2 and e3 are the values of 0 at which Mxg(B) equals pi, P2 and p3, respectively. The second feature of additivity practially holds in the acceleration model. The solutions of (6.1.1.), (6.1.2) and (6.1.3) were also adopted in estimating the parameters in the acceleration model and used in Figure 1, with the processing functions in Master's partial credit model as Mxg(e)'s which are obtained by (3.18) and (3.19). In divided-by-total models represented by extended Bock models, the operating characteristic resulting from combining r adjacent graded categories becomes PXg(e)= 21u EXg exp [aue+qu]' which does not belong to the original model. It is also impossible to divide a response into more finely categorized responses and preserve (3.15) for the resulting finer responses. Thus additivity will not hold in either direction. Lack of additivity will cause serious problems if Masters' partial credit model or Muraki's generalized partial credit model is used for typical ordered response situations. Usefulness of these models is limited, therefore, to situations in which all response categories have certain absolute meanings, with no room for arbitrari ness. 6.2 Natural generalization to a continuous response model For any model which belongs to the homogeneous case, from (3.3) and (3.4) the operating density characteristic, H,(8), for the continuous response zg will be obtained by Hzg(e)=a B m Pe(e) ~ zg+azg)(e) =agcb(ag(e-bzg))[ db g ~'

17 where b,zg is the difficulty parameter for the continuous response zg. Unlike bxg in ordered polychotomous models, b,zg is a continuous, strictly increasing function of zg. Examples of such models can be seen in the normal ogive and logistic models (Samejima, 1973a). This characteristic is not shared by all models in the heterogeneous case. The acceleration model can be generalized to a continuous model, however. This can be seen by treating each subprocess in (3.11) as a step, and continuous subprocesses within each original step are considered as the limiting situation when sxg tends to positive infinity and Ewgz approaches zero. In contrast, such a limiting situation for any extended Bock model cannot be considered. 6.3 Satisfaction of the unique maximum condition In the homogeneous case, it has been shown (Samejima, 1972, 1973b) that many models, including the normal ogive and logistic models, satisfy the unique maximum condition, although it is not satisfied by the three-parameter normal ogive or logistic model. Note that, because additivity and generalizability to a continuous response model hold for any model in the homogeneous case, as was observed and discussed earlier, satisfaction of the unique maximum condition is carried to any more finely classified response categories and any combined response categories, and also to continuous responses in its generalized continuous response model (see Samejima, 1972, 1973a). In the heterogeneous case, it has been shown (Same] ima, 1995) that the unique maximum condition is satisfied for the acceleration model when (3.10) is used for MXg(O). Note that this satisfaction is also carried to more finely categorized steps and also to continuous responses in the generalized continuous response model. It has been shown (Samejima, 1972) that this condition is satisfied for Bock's nominal model. This implies that in all extended Bock models, including Masters partial credit model and Muraki's generalized partial credit model, the unique maximum condition is also satisfied. 6.4 Orderliness of the modal points of the operating characteristics It has been shown (Samejima, 1972) that, in the homogeneous case, those models which satisfy the unique maximum condition, such as the normal ogive model and the logistic model, provide a strict orderliness among the modal points of PXg(e)'s in accordance with the item score xg. In the acceleration model, it has been shown (Samejima, 1995) that ordered modal points of the operating characteristics exist in usual situations. There are exceptions, however, in which the unidimensionality of the latent space is question able, and an example is shown in the same paper. In usual situations, the modal points are found in the ascending order of the item score xg, as illustrated in Figure 1. In Bock's nominal model, it has been shown (Samejima, 1972) that ordered

18 Table 1 Summary of the principles behind the normal ogive and logistic models, the acceleration model and extended Bock models. modal points of the operating characteristics exist, if strict inequalities hold in all relationships of (3.17). This implies that in extended Bock models, including both Masters' partial credit model and Muraki's generalized partial credit model, the modal points are always strictly ordered. Comparison of these models with respect to the criteria that are legitimate in typical ordered polychotomous situations can be summarized as shown in Table Discussion and conclusions Several criteria for model evaluation have been proposed, and strengths and weaknesses of mathematical models in the homogeneous case, and of the accelera tion model and extended Bock models in the heterogeneous case have been observed and discussed. It has been pointed out that principles behind models in the homoge neous case fit psychological realities of certain typical graded response situations such as categorical judgment and rating scales. As was pointed out earlier, satisfaction of additivity and generalizability to a continuous model tend to be more difficult for models in the heterogeneous case. Those models are in demand, however, because of their less restrictive nature in P g(o)'s. The principles behind extended Bock models do not fit typical ordered poly chotomous response situations, however. Bock proposed (3.15) as a nominal model, and wisely applied his model to multiple-choice test items, having the results disclose implicit orders among the distractors of each item. Discovery of implicit orders behind nominal responses is a big strength of Bock's model. Information coming from the distractors can be used for improving the multiple-choice test item, for example. Satisfaction of the unique maximum condition and orderliness of the modal points of the operating characteristics intrinsic in the model are also big strengths, for information coming from each and every nominal response can

19 effectively be used in ability estimation. It may be wise, therefore, to let Bock's model stay as a nominal response model as Bock himself intended it to be when he proposed the model without expanding it to ordered polychotomous models. It is obvious that the last two criteria in model evaluation are invariant for strictly monotone transformations of 0. That is to say, if a model clears these criteria with 0 as the ability scale, the same will be true with the transformed ability scale r(0), as long as the transformation is strictly monotone. Homogene ity in a model is not invariant across strictly monotone transformations, however. Thus those models in the homogeneous case must be adopted with a carefully selected meaningful ability scale. REFERENCES Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability, Contributed chapters in Lord, F.M. and Novick, M.R., Statistical theories of mental test scores, Chapters 17-20, Reading, MA: Addison Wesley. Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, Luce, R.D. (1959). Individual choice behavior, New York : Wiley. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, Muraki, E. (1992). A generalized partial credit model : application of an EM algorithm. Applied Psychological Measurement, 16, Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17. Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph, No. 18. Samejima, F. (1973a). Homogeneous case of the continuous response model. Psychometrika, 38, Samejima, F. (1973b). A comment on Birnbaum's three-parameter logistic model in the latent trait theory, Psychometrika, 38, Samejima, F. (1979). A new family of models for the multiple-choice item. University of Tennes see, Knoxville, TN : Office of Naval Research Report, Samejima, F. (1993a). An approximation for the bias function of the maximum likelihood estimate of a latent variable for the general case where the item responses are discrete. Psychometri ka, 58, Samejima, F. (1993b). The bias function of the maximum likelihood estimate of ability for the dichotomous response level. Psychometrika, 58, Samejima, F. (1995). Acceleration model in the heterogeneous case of the general graded response model. Psychometrika, 60, (to be published in the December issue). Samejima, F. (in preparation). Virtues of asymmetric item characteristic curves. Thissen, D. & Steinberg, L. (1986). A taxonomy of item response models Psychometrika, 51, (Received October, 1995)

A Use of the Information Function in Tailored Testing

A Use of the Information Function in Tailored Testing Fumiko Samejima University of Tennessee for indi- Several important and useful implications in latent trait theory, with direct implications vidualized