+ 020) + u3(k) + u12(ij) + u23(jk),

Size: px

Start display at page:

Download "+ 020) + u3(k) + u12(ij) + u23(jk),"

Ira Copeland
6 years ago
Views:

MISCLASSIFICATION PROBLEM AND ITS RELATION TO THE CONTINGENCY TABLE WITH SUPPLEMENTAL MARGINS T. Timothy Chen, The Upjohn Company 1. Introduction. In many studies, data may have errors.

1 MISCLASSIFICATION PROBLEM AND ITS RELATION TO THE CONTINGENCY TABLE WITH SUPPLEMENTAL MARGINS T. Timothy Chen, The Upjohn Company 1. Introduction. In many studies, data may have errors. could happen if we -use fallible and inexpensive, rather than exact and expensive, devices to measure some variables. For example, in epidemiological studies, data are usually collected from an inexpensive interview instead of,physicians' examination or laboratory chemical tests. If the data are categorical, this problem is called the misclassification problem. Suppose we are interested in one variable only, which has r possible categories; due to using a fallible and inexpensive device, we observe a different variable with same r categories. Let us use a two -dimensional r x r contingency table to represent the situation, the first dimension is the fallible classification and the second dimension is the correct or true classification; let the probability of any observation having (i,j) as its fallible and correct classification be and = 1. The elements } of misclassification matrix A is defined as ai,j =., which is the conditional proba- +3 bility of any observation having í as the fallible classification given that it has j as the true classification. If a. 's do not depend on j, then we have a random misclassification and = Now instead of just one variable, we are interested in the interrelationship between two variables, where the first variable X has r possible categories and is subjected to misclassification, and the second variable Y has t possible categories and can be easily determined without error. Let us use a three -dimensional r x r x t contingency table to describe the situation; the first and the second dimensions represent the fallible and the correct classifications of the variable X, and the third dimension represents the variable Y. The misclassification matrix A is a r by rt matrix with elements {ai,jk }, where a i,jk which is the conditional probability /n +jk' of any observation having i as the fallible X value, given j,k are the true X and Y values. If ai,jk's do not depend on k, then the misclassifi- cation is the same for any Y value and we have nijk +.n (1.1) which is the model of conditional independence of the first and the third dimensions in each layer of the second dimension. Let us denote this model by H(12,23), where the 12- marginal and the 23- marginal counts are the complete minimal sufficient statistics under the Poisson or multinomial sampling schemes (see Bishop, Fienberg, and Holland (1975) and Haberman (1974)). From equation (1.1), we can see that independence on the 23- margin implies independence on the 13- margin, but not vice versa unless the matrix A has r as its rank. Diamond and Lilienfeld (1962), Newill (1962), and Rogot (1961) considered the above model in the case r = t = 2 and they showed that and +11 v+12 n n1+1 1'1+2 r1+1n 2+2 r1+2r2+1 (1.2) (1.3) In epidemiology, if Y represents two different populations, and X represents having disease or not, then the above two equations say that the true risk difference is greater than the fallible or stated risk difference, and the true approximate relative risk (true odds ratio or its inverse whichever is greater than 1) is greater than the fallible or stated approximate relative risk. But these will not be true with probability one when we substitute the population by the ob- served proportions. Equations (1.2) and (1.3) can be explained intuitively by the log- linear repre sentation of the model log nijk u ) + u3(k) + u12(ij) + u23(jk), (1.4) where we see no u 13 -terms; hence, the risk differ- ence and the approximate relative risk on the 13- margin are smaller than those of the 23 margin. Since it is very expensive to observe the true X values, we usually only collect the fallible X and the true Y values; i.e., we only observe the 13- margin of a three -dimensional contingency table. Bross (1954), Rubin, Rosenbaum, and Cobb (1956), and Mote and Anderson (1962) discussed about the inference of the relationship between true X and true Y (23- margin) in this situation. They concluded that the usual chi - square test of independence or homogeneity on the observed 13- margin is a correct a -level test with less power for the independence or homogeneity on the unobserved 23- margin, provided that the model H(12,23) is true and the misclassifica- tion matrix A has r as its rank. Now let us discuss the situation where the variable Y is also subjected to misclassification. 765

Let the fallible and the true X be the first and the third dimensions, the fallible and the true Y be the second and the fourth dimensions of a four- dimensional contingency table.

2 Let the fallible and the true X be the first and the third dimensions, the fallible and the true Y be the second and the fourth dimensions of a four- dimensional contingency table. The misclassification matrix A is a rt by rt matrix with each element a = /n If the ele- ij,kl ijkl ment of the matrix A, aij,kl is a product of two probabilities: one is the probability of any observation being fallibly classified into i on X variable given that its true X is k, and the other is the probability of any observation being fallibly classified into j on Y variable given that its true Y is 1, then we have a model of independent misclassification and = (i+k+ +j+1 ++k+ (1.5) From the above equation, it's clear that independence on the 34- margin implies independence on the 12- margin, but not vice versa unless the matrix A is non -singular. Also, if we only look at the marginal table, then the misclassfication matrix is independent of the variable Y. For the 234- marginal table, the misclassification matrix is independent of the variable X. We denote this model by H(13,24,34)and we have a log- linear rep - prsentation: log r u + ul(i) + u2(1) + u3(k) (1.6) + u4(1) + u13(ik) (11) where we don't have u12- terms. Keys and Kihlberg (1963) and Gullen, Bearman, and Johnson (1968) discussed the above model in the case r =t =2 and they showed that r It can also be shown that n (1.7) (1.8) If Y represents two populations, and X represents having disease or not, above equations say that the risk difference and the approximate relative risk on 12- margin (fallible X and Y) are smaller than those of 34- margin (true X and Y), which can be explained intuitively by equation (1.6). When we observe only the fallible classifications for both variables, under assumptions of independent misclassification and non -singularity of the matrix A, Àssakul and Proctor (1967) showed that the usual chi -square test of independence on the observed 12- margin would give us a correct a -level test of independence on the unobserved 34- margin, but misclassification reduced the power of this test comparing to the direct test on the 34- margin. In case of non -independent errors they showed that the test on the 12- margin, in general, would have a larger type I error for the independence hypothesis on the 34- margin. Above discussion shows that log- linear models provide a class of models which give meaningful interpretation for the misclassification matrix, and under some models the test on the observed fallible data provide a correct test for the unobserved true data. But unless from past experience or from examination of some data which have both the fallible and the true values, we are not sure about the applicability of a particular model for the misclassification matrix. Therefore, besides observing the inexpensive fallible data, we should also collect both fallible and true data on some observations. This is the double sampling scheme proposed recently by Tenenbein (1970, 1971, 1972) and Chiaccheierini and Arnold (1977); the data collected can be presented as a full contingency table of both fallible and true data with a supplemental lower dimensional margin of fallible data. 2. Double Sampling Scheme. In this section we will discuss how to analyze categorical data with misclassification and double sampling. The detail of analysis will be shown for a three - dimensional contingency table with the first and the second dimensions representing the fallible and the true X, and the third dimensions representing the true Y variable. Suppose we observe n subjects with all three dimensions, and N -n subjects for the first and the third dimensions; the observed counts in the full table are denoted the observed counts in the supplemental by xijkand 13- margin are denoted by Vik (where = n, and Vik N -n). We assume all xijk are greater than zero for simplicity. The main inference is about the independence of the true X and the true Y variables, but specifying a correct structure of misclassification may give us a better power for the test. The structures of misclassification we want to investigate are those log- linear models having u23 -terms like H(123), H(12,13,23)' and H(1,23). The first H(1223)* H(13,23)ß model can be expressed as H(123) log nijk u + + u2(j) + u3(k) u12(ij) + u13(ik) + u 23(jk) + u123(ijk) ' (2.1) 766

3 with each set of subscripted u -terms adds to zero when summed over any subscript.. This is the unrestricted (saturated)..modal. where we have no restriction on The. second model is a model of no second =order interaction with u123(ijk) 0 in (2.1). The third and the fourth models are models of conditional independence as explained in section 1. The fifth model is a model of independence between the first dimension and the other two dimensions, which is equivalent to (2.1) with u12(ij), all set to zero. u13(ik)' u123(ijk) Since we have double sampling data, the expected counts for xijk and Vik are nirijk and (N- respectively. Under the unrestricted +k model H(123), we have the following ML equations: Nrijk = + Vik rijk /ri +k, Vi,j,k, (2.2) where the right hand side is the observed count in the cell (i,j,k) plus a proportional allocation of supplemental marginal count to that cell based on the MLE's {rijk }. interaction model, are given by: For the no second -order the ML equations Nrij+ xij+ + Vik rijk +k' Vi,j, (2.3) Ni+k xi+k + Vik' Vi,k, (2.4) and Nir+jk = x+jk + Vikrijk/ri+k' Vj.k. (2.5) Next for the model the ML equations are H(12,23) given by equations (2.3) and (2.5). For the model the ML equations are given by equation H(1,23)' (2.5) and A = xi++ + Vi In general the ML equations will correspond to the highest order subscripted u -terms in the model. We can use an iterative procedure such as the one described below to get a numerical solution to the ML equations. The iterative procedure we propose is an extension of the iterative proportional fitting used by Bishop et al (1975), Goodman (1970) and Haberman (1974). For all modela, we take. the same initial value: 1 /r2t for all i,j,k. For a given log - (2.6) linear model each cycle consists of a set of pairs of steps, each pair corresponding to one of the sets of ML equations for the model. For example, for the model H(12,13,23)' each cycle of the :iterative procedure consists of the following six steps: (v+1) (v) (v +1)/ (v) ijk rijk ij+ ' (v+2) i +k \xi +k Vik N' r(v+2) r(v+l) ijk ijk i +k i +k ' ik r(v+3) V r(v+2)i (v+2)1/n +jk +jk ik i jk i +k / r(v+3) r(v+2) r(v+3)ir(v+2) ijk ijk +jk + jk' yi,j, (2.7) Vi,j,k, (2.8) Vi,k, (2.9) vi,j,k (2.10) Vj'(2.11) Vi,j,k. (2.12) For the model H(12,23)' each cycle of the itera- tive procedure consists of the four steps given by equations (2.7), (2.8), (2.11) and (2.12) with r(v+2) ijk ijk' Once we have the MLE's for cell probabilities, we can compute either the Pearson or likelihood ratio statistics to test the goodness -offit of the model: x2 and (x3k - (Vik nri (N-n) ri+k G2 = 2EEE xijk log Xiik + 2EE Vik log nrijk (N- V (2.13) +k (2.14) with appropriate degrees of freedom. For the model H have estimated u u u (12,13,23) 1, 1 2, 3, terms; hence, we have (r2t-1) + u12, u13, u23 ( rt- 1)- 2(r- 1)- (t- 1)- 2(r- 1)(t- 1)d.f. for the tests. We will first fit the model H(123) to the data {xijk to find out whether they are }, {Vik consistent to each other, i.e., whether {xijk} and {Vik} are both random samples from the same target population. After showing this model fits the. data, we can fit the next simple model We.can examine both unconditional test and conditional test (which is the difference between two unconditional tests) statistics 767

4 to decide whether to accept this model. We can proceed like this to choose a most appropriate and simplest model to describe the data. The general step -wise procedure of fitting models for a contingency table has been described in Goodman (1971). After a final model for the full table which still has u23 -terms has been chosen, i.e., we have chosen a model for the misclassification matrix, we can now test the independence (or homogeneity) of true X and true Y in the 23- margin (H*(2,3)). We will again obtain the MLE's } under a particular model for the ijk full table plus the model H *(2,3). Under the model H(123) and H *(2,3), we have the following ML equations: N E H *(xijk)- xijk + +k (2.15) where E (xijk [i,k(xijk + Vik Aijk[i(xijk + Vik +k)] +k)]} which is the adjustment of ñijk by the independ- ence hypothesis on the 23- margin. Under the model and *(2,3), the ML equations are H(12,13,23) given by (7.3), (2.4), and (2.5) with the left hand sides substituted by N E EH...etc. These three ML equations can be solved by the following iterative procedure with n(o) ijk = *(0) = 1/r2t, /r2ti ijk Vi,j,k, then (2.16) *(v+1) (x + E V rr(v) ij+ ij+ ík ijk / +k) / N, Vi,j, (2.17) 7*(v +1)= *(v +1) /7*(v) vi,j,k (2.18) ijk ijk ij+ ij+ ' rr(v+l) Tr*(v+l) *(v +l) *(v +l) (v +1) ijk ijk +j+ ++k/ +jk Vi,j,k, (2.19) and the other six steps are just similar modifications of (2.9), (2.10), (2.11), and (2.12) into procedures like (2.17), (2.18), and (2.19). The rationale behind the whole procedure is that we first obtain *(v) in the parameter space speci- fled by the model for the full table, then we adjust *(v) to which is in the intersec- tion of the above space and the space specified by H *(2,3). The convergence can be achieved if there is no empty cell in the full table, since the likelihood function is concave and bounded above. Once the MLE's are obtained, we can ijk test the goodness -of-fit of the model by computing either the Pearson or likelihood ratio statistics as (2.13) and (2.14). For the model *, since we have 23- marginal H(12,13,23) H (2,3) independence constraints on those u- terms, we reduce the number of free u -terms by (r- 1)(t -1), so we have (r2t-1)+(rt-l)-2(r-l)-(t-l)- (r- l)2- (r- 1)(t -l) d.f. for the tests. We will decide whether H *(2,3) is true or not condition- ing upon a particular model for the misclassfication matrix. The value of this conditional test statistic does depend on the model we've specified for the full table. It should be noted here, the model H(12,23) H *(2,3) is equivalent to H (12,3), similarily H *(2,3) is H(13,2) and H(1,23) H(13,23) H *(2,3) is H which is mutual independence of (1,2,3), three dimensions. The model H(12,13) does not have u23-terms, hence the ML equations for H (12,13) (2,3) are not the type specified in (2.15) and (2.16). The method described above can be extended easily to higher dimensional table with many variables subjected to misclassification. We will first build log- linear models for the full table (including both fallible and true classifications), which have u -terms corresponding to the lower dimensional margin of true classifications. The method for this was explained in detail in Chen (1972). After a model is finally chosen for the full table, we can then build log- linear models for the lower dimensional, margin of true classifications using similar procedure as explained in this paper. The iterative procedures proposed herein are examples of the generalized EM algorithm given in Dempster, Laird, and Rubin (1977). A computer program, which is an extension of Haberman (1972), has been written according to the method in this paper to give MLE's of cell 768

probabilities and counts under different models and produce both goodness -of -fit statistics with appropriate degrees of freedom. It is available to any interested person upon request.

5 probabilities and counts under different models and produce both goodness -of -fit statistics with appropriate degrees of freedom. It is available to any interested person upon request. Tenenbein (1970, 1971, 1972) first proposed using a double sampling scheme to make inference about categorical data with misclassification. He only discussed the estimation problem in one variable case without any assumption on the misclassification matrix. The estimates he obtained are similar to those obtained in Chen and Fienberg (1974). He derived formula to determine the optimum double sampling ratio (n /N) so that the variances of estimates are smallest; his formulas may be used in our model building problem. Chiacchierini and Arnold (1977) discussed a test of independence for the two variable case with r =t =2, which is our conditional test of H *(3,4) given that H(1234) is true. 3. An Example. Cobb and Rosenbaum (1956) reported an arthritis study in the Arsenal Health District of Pittsburgh. A household morbidity survey was conducted in July, 1952, using a random sample of 3,000 households. All the persons over 14 years old in these households were classified into three strata, based on the information regarding rheumatism and arthritis obtained by non - medical interviewers: Stratum 1 consisted of individuals who were recorded as having arthritis or rheumatism; Stratum 2 consisted of individuals who were recorded free of arthritis or rheumatis, but were reported to have some rheumatic symptoms; Stratum 3 was made up of the remainder who were not recorded as suffering from rheumatis, arthritis, or related manifestations. A random sample of persons was selected for each sex separately and within each strata. The sampling rate was 60% for males and 30% for females in the Strata 1 and 2, 7% for both males and females in Stratum 3; this resulted in a total sample of 798 persons. Each person thus sampled was visited in his home by a non - medical interviewer equipped with the detailed arthritis questionnaire, and the individuals who were interviewed were urged to have an examiniation by physicians in the arthritis clinic. Some persons refused the interview, or were unavailable for interview, and some did not return to the clinic for examination. The final data included 478 people with both the interview and the examination. The data about whether the person had joint pain is given in Table 1. The two "unknown" rows were not reported in Cobb and Rosenbaum (1956); instead, they are generated artificially as supplemental data to demonstrate the methodology. Let the first dimension be the interview result, the second dimension be the physician's history, and the third and fourth dimensions be the strata and the sex. 1. Number of Persons Having Joint Pain by Sex and Stratum as Obtained by Physicians vs by Non - Medical Interviewers Interview Result Physician's Yes No Examination Stratum Stratum a.males Yes No Unknown Yes No Unknown b. Females We first fit the model H(1234) just to see whether the supplemental data are consistent with the data in the full 2 x 2 x 3 x 2 table. This model fits the data very well with X2 = 4.19 and G2 = 4.20, 11 d.f. We then try to fit the models which will give us nice interpretations for the misclassification matrix. Among the models H(123,234)' H the model (124,234) (134,234). fits the data well with X and H(123,234) G2 = 11.87, 17 d.f. When we try to fit simpler models which have the misclassification probabilities in explicit formula of the marginal probabilities, H(12,234) and H(13,234)' both fail to fit the data. We then try to tit the model and this model fits well with H(12,13,234)' X and G , 19 d.f.; therefore, we will use it to interpret the misclassification matrix. Under this model we have or ijkl ijk+7 +jkl/+jk+' V i,j,k,l, (3.1) 1Tijk+/7+jk+, (3.2) Hence, the misclassification matrix are uniform over sex, and only dependent on strata. Now we try to investigate relationship among the margin of true joint pain, strata, and sex, given that the model H true; (12,13,234) it turns out that the simplest model, which still has good fit, is H(12,13,234) with H *(23,4) X , G2 = 16.14, 24 d.f. But, since we have the fixed sex by strata margin (34- margin) originally, we have to settle on the model H(12,13,234) as the final model: the joint pain and H *(23,34) the sex are conditionally independent given the 769

6 strata. The conclusion is that the prevalence rate of joint pain is not a function of sex, but only a function of strata. The estimates of proportions of classification errors, and the estimates of prevalence rates for joint pain under the final model H H* are given (12,13,234) (23,24) g in Table 2 by stratum. 2. The Estimates of Proportion of Classification Errors and the Estimates of Prevalence for Joint Pain by Stratum Under the Model H H* (12,13,234) (23,24) False Negatives False Positives Physicians' Interviewers' Stratum a. Classification Errors b. Prevalence Estimates Dempster, A.P., Laird, N.W., and Rubin, D.B. (1977), "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. R. Statist. Soc., Ser. B., 39, Diamond, E.L., and Lilienfeld, A.M. (1962), "Effects of Errors in Classification and Diagnosis in Various Types of Epidemiological Studies," Amer. J. Public Health, 52, Goodman, L.A. (1970), "The Multivariate Analysis of Qualitative Data: Interactions Among Multiple Classifications," J. Amer. Statist. Assoc. 65, Goodman, L.A. (1971), "The Analysis of Multidimensional Contingency Tables: Stepwise Procedures and Direct Estimation Methods for Building Models for Multiple Classification," Technometrics 13, Haberman, S.J. (1972), "Log-Linear Fit for Contingency Tables," Applied Statis., 21, REFERENCES Assakul, K., and Proctor, C. H. (1967), "Testing Independence in Two -Way Contingency Tables with Data Subject to Misclassification," Psychometrika, 32, Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. (1975), Discrete Multivariate Analysis: Theory and Practice, Cambridge, Massachusetts: MIT Press. Bross, I. (1954), "Misclassification in 2 x 2 Tables," Biometrics, 10, Chen, T.T. (1972), "Mixed -up Frequencies and Missing Data in Contingency Tables," Unpublished Ph.D. Dissertation, Dept. of Statistics, Univ. of Chicago , and Fienberg, S.E. (1974), "Two -Dimensional Contingency Tables with Both Completely and Partially Cross -Classified Data," Biometrics, 30, Chiacchierini, R.P., and Arnold, J.C. (1977), "A Two -Sample Test for Independence in 2x 2 Contingency Tables with Both Margins Subject to Misclassification," J. Amer Statist. Assoc. 72, Cobb, S., and Rosenbaum, J. (1956), "A Comparison of Specific Symptom Data Obtained by Nonmedical Interviewers and by Physicians," J. Chronic Diseases, 4, Haberman, S.J. (1974), The Analysis of Frequency Data, Univ. of Chicago Press, Chicago. Keys, A., and Kihlberg, J.K. (1963), "Effects of Misclassification on Estimated Relative Prevalence of a Characteristic," Amer. J. Public Health, 53, Mote, V.L., and Anderson, R.L. (1962), "An Investigation of the Effect of Misclassification on the Properties of X2 -Test in the Analysis of Categorical Data," Biometrika, 52, Newell, D.J. (1962), "Errors in the Interpretation of Errors in Epidemiology," Amer. J. Public Health, 52, Rogot, E. (1961), "A Note on Measurement Errors and Detecting Real Differences," J. Amer. Statist. Assoc., 56, Rubin, T., Rosenbaum, J., and Cobb, S. (1956), "The Use of Interview Data for the Detection of Associations in Field Studies," J. Chronic Diseases, 4, Tenenbein, A. (1970), "A Double Sampling Scheme for Estimating from Binomial Data with Misclassification," J. Amer. Statist. Assoc., 65, (1971), "A Double Sampling Scheme for Estimating From Binomial Data With Misclassification: Sample Size Determination," Biometrics, 27, (1972), "A Double Sampling Scheme for Estimating From Misclassified Multinomial Data With Application to Sampling Inspection," Technometrics, 14,

Describing Contingency tables

Today s topics: Describing Contingency tables 1. Probability structure for contingency tables (distributions, sensitivity/specificity, sampling schemes). 2. Comparing two proportions (relative risk, odds