Jerome Kaltenhauser and Yuk Lee

Size: px

Start display at page:

Download "Jerome Kaltenhauser and Yuk Lee"

Muriel Bradley
5 years ago
Views:

1 Correlation Coefficients for Binary Data In Factor Analysis Jerome Kaltenhauser and Yuk Lee The most commonly used factor analytic models employ a symmetric matrix of correlation coefficients as input. The Pearson product moment coefficient is appropriate when the variables are continuous. For binary (two-valued) data there is no single agreed-upon coefficient [ 71. Three coefficients-phi, phi/phimax, and tetrachoric-are frequently discussed in the literature and are the focus of the present investigation. It is assumed that the coefficient is to be used in factor analysis and specifically in exploratory factor analyses [ 31. The characteristics of these coefficients have been discussed by others [Z, 4, 81 and will be quickly summarized here. It is convenient to discuss the coefficients in terms of a two-way table relating the occurrences of the values of the binary variables x and y to be correlated. X\Y C d c+d 1 b a b+a c+b d+a 1.0 a, b, c, and d are the joint frequencies of combinations of values of xand y, while c + d, a + b, b + c, and a + dare the marginal frequencies or proportions for y and x. Despite the special name and numerous expressions available for it, phi is simply the Pearson product moment formula applied to binary data. That is, phi = r= xty, (1) where x is the vector of standardized values of variable x (and xt is its transpose) and similarly for y; r is the product moment correlation coefficient. A defect of phi is that it can achieve the range -1.0 to 1.0 only under rare circumstances, when a+b a+d -=c+d b+c', Jerome Kaltenhauser is systems analyst, Computing Center, and Yuk Lee is associate professor of geogmphy, University of Colorado.

2 306 / Geographical Analysis Thus, phi will usually suffer some restriction in range. The simplest way to bring phi up to full range is to normalize it, and this produces phi/phimax. Phi/phimax is obtained by dividing phi by the maximum value it could assume consistent with the set of marginals from its two-way table: Pl 9 s phimax = - -, 91 Ps (3) where p, is the largest marginal in the table, p, is the second largest, and p + q = 1.0 [2]. The tetrachoric coefficient was derived on the assumption that the observed frequencies in the two-way table express an underlying bivariate normal distribution. If a bivariate normal distribution is divided into quadrants by partitions parallel to the x- and y-axes, then the tetrachoric is a maximum likelihood estimate-using only the quadrant frequencies-of the product moment coefficient which would be calculated if the full bivariate distribution were available [4, 61. Because calculation of the tetrachoric r involves evaluation of an infinite series, approximations must be resorted to. Of the three coefficients, only phi suffers from restriction of range. This is important when correlations are interpreted directly as a measure of relationship between two variables. It is not necessarily of importance in factor analysis where correlations are not directly interpreted. Phi has been used successfully in factor analysis, and an example of its performance relative to phi /phimax and the tetrachoric will be presented. Later we will see that phi has characteristics that would make it seem even less likely as a candidate for any kind of analysis. Coefficient Performance in Factor Analysis A preliminary evaluation of the three coefficients was made in a principal components analysis involving social, economic, and demographic data for Colorado counties. A matrix of 62 county observations on 18 variables was prepared and submitted to a principal components analysis using the SPSS package [ 51. The variables, which are continuous, were transformed to near normality to approximate the requirements of the tetrachoric coefficient (underlying bivariate normal distribution). For all but three of the variables the transformations were successful (50 percent confidence level using a chi-square test), and the exceptions do not appear to have seriously disturbed the results. With the variables still in continuous form, the above procedure produced a principal components analysis of a system of variables; the rotated factor loadings were used subsequently as a criterion against which to measure the loadings produced when the variables were converted to binary form. To convert variables to binary form, a cutoff point b must be chosen for each, with those values below b set to 0 and those above set to 1. Any two such binarized variables may then be related through a

3 Research Notes and Comments / 307 two-way table and phi, phi/phimax, and tetrachoric correlation coefficients computed. The cutoff points may differ from variable to variable and from --m to m. When they leave the range p & IT, however, the coefficients will give poor results for moderate sample sizes. In the present example b was restricted to &0.84a which, for normal distributions, will allow up to an 80 percent-20 percent split of 0 s and 1 s. A computer program was written to calculate phi, phi/phimax, and tetrachoric given the desired cutoff points. The tetrachoric coefficient was calculated using the method described by Kirk [4]. A matrix of each type of coefficient was output for each of a large number of different b values, and the matrices were then submitted to the principal components analysis. A set of rotated factor loadings was obtained in each case, to allow comparison with the corresponding loadings from the continuous variables cases. The comparison of the loadings was carried out by computing the root-mean-square (RMS) difference between the continuous and binary situations : where M is the number of loadings compared on a given factor, F is the number of factors, C, is a continuous loading, and B, is the corresponding loading from the binary case. If B, = C, for all loadings, RMS would be zero, indicating a perfect match. A large RMS indicates a poor match, but the scale is arbitrary. In the present example, the RMS values for phi/phimax and tetrachoric will be compared with those from phi. Figure 1 shows the results for a large number of runs. The runs 3.1 k Y.d a Tetrachoric VB. Phi I PhilPhinax VS. Phi Tetrachoric and PhilPhimax RHS Deviation FIG. 1. Performance of Coefficients of Binarized Data

4 308 / Geographical Analysis are distinguished from each other by the choice of cutoff points used to binarize the data. There was only one continuous case run since the choice of points for binarization does not affect the continuous case. To exclude insignificant variability, only loadings that exceeded were used in the comparisons. It will be seen in Figure 1 that the RMS error for phi is generally smaller than for either phi/phimax or the tetrachoric, despite the fact that cutoff points far from the mean-which should restrict the range of phi considerably-were chosen in many cases. Thus phi appears in this example to give good results in factor analysis. Of course one empirical.example establishes nothing. Results may vary with sample size (here N = 62) and with variable distributions. Normal marginal distributions, as effected here by transformations, do not guarantee the bivariate normality required by the tetrachoric [I]. To explore the effects of some of these parameters a simulation study was undertaken. Simulation Study A simulation in the present context offers the advantage that the population parameters can be controlled and varied at will, but suffers from the defect that a random element is introduced. Each simulation presents us with only a particular outcome and so can never establish a general case. This can be mitigated by examining a large number of outcomes for regularities, however. In this study, sets of N values of variables x and y were generated, with r,,, being the population correlation coefficient and N the sample size. The set of N x,y pairs may then be regarded as a random sample of N points from an infinite population with a correlation coefficient I,. The method of generating x and y simulated drawings from a bivariate normal distribution. In particular, x = n(2,; 0.0, 1.0) y = xr,,, + (1.0- r&)l z n(zj; 0.0, l.o), (5) (6) where z, and zj are random numbers and n(z; k, a) is a function that converts a number to a point on the normal cumulative distribution curve. x, which is generated first and then substituted into the equation for y, is a random normal variate with (population) zero mean and (population) unit variance. y is similarly distributed, and the values of x,y pairs are such as to simulate two variables correlated r,,,. Once N such x,y pairs are available, the sample correlation coefficient can be calculated. x and y can be binarized and phi, phi/phimax, and tetrachoric coefficients calculated. This process can be done repeatedly to get a sample of correlation coefficients (in the previous instance x,y pairs only were sampled whereas here correlation coefficients are sampled). N can be varied to examine the effect on the binary correlation

5 Research Notes and Comments / 309 coefficients. In addition, to see the effect of nonnormal distributions, the x,y pairs generated according to equations (5) and (6) were perturbed in twoways: first, varying amounts of uniform random noise were added in, and second, x and/or y were transformed according to 2 = ( x + C)h, (7) where h was allowed to vary up to 2.75 and c is a constant selected such that x + c > 0. In each instance, x and y were restandardized prior to calculating the correlation coefficients. Figure 2 presents the averages of sample correlation coefficients generated as above. Figure 2 contains nine panels, each containing plots of average binary coefficients as a function of sample size N. Columns of panels are differentiated by population rxy. Rows of panels are differentiated by the binarization points employed. The runs in tbe top row were binarized at the mean-b(x) = 0.0, b(y) = 0.0-and in the lower panels b(y) departs progressively from the mean. b(y) = 0.84 implies that the binarization point for y was at p u, or, with p = 0.00 and u = 1.0, at All values of y greater than 0.84 were converted to 1 and all others to 0. All phi values in the lower right-hand panel are less than rm , FIG. 2. Artificial Data: Calculated Coefficients with Varying Sample Parameters

6 310 / Geographical Analysis Each data point in Figure 2 represents the average of a number of sample binary correlation coefficients. N = 50 implies that fifty x,y pairs were used for each coefficient; sixteen such coefficients were calculated and averaged, for a total of eight hundred x,y pairs. Eight hundred were used for every data point, and thus N = 100 denotes eight coefficients of one hundred x,y pairs apiece, and so forth. Phi, phi/phimax, and tetrachoric show similar trends with Nbecause a single binarized set of data for each N was used to calculate the three coefficients. The data sets for different values of N are independent. The most striking regularity in Figure 2 is that phi systematically underestimates rry. The defect grows with increasing T,.~ and increasing departure of b(y) from the mean. It is apparent that phi is not a close estimate of the continuous product moment coefficient. Phi /phimax and tetrachoric on the other hand appear to supply reasonable estimates of this parameter, even for extreme binarization points. (Phi/phimax is absent from the top diagrams of Figs. 2-4 because phi and phi/phimax become the same in those situations.) How then can phi perform so well in factor analysis? A clue can be found in equation (8) for factor loadings. 1.b c 1.3. B OPhi b(d-o.00 ).5 mphilphiux b(y)-0.00 Aretrichoric 1. *%$ b(x)-o.oo b(ybo s. F b(d-0.m) b(yp FIG. 3. Artificial Data: Values of Fig. 2 Fiatioed by the Highest Coefficients

7 Research Notes and Comments / 311 where R, is the mxm correlation matrix and F is the matrix of factor loadings. If amatrix K,, with constant kevergere is introduced, then Kmxm Rmxm = Kmxm F- T Fp-9 K, R,,, = Kg& F,, F&, KALk, or Kmxm Rmxm = (K2:m Fmxp )(Fpxm Kn%/,",) * or (9) In other words, if all values in the correlation matrix are multiplied by k, the corresponding loadings will be changed by k1i2 but the relative values of the factors will be unchanged. Interpretation of the factors should not be altered, especially after rotation. Figure 3 compares the coefficients with the effect of a hypothetical constant k removed. In each panel, the average coefficient for rry = 0.20 and rxy = 0.50 is divided by the average for rxy = If k were not a constant-that is, if k varied with the magnitude of rx -a trend should be visible with varying rxy. This does not appear to i e so, and it can be concluded that k is very nearly a constant. There is a slight trend with b( y)-comparing with the tetrachoric-but it would appear that in a real data matrix, where the binarization points would be mixed for different variables, it should be of little consequence. This appears -~ Frc. 4. Artificial Data: Sampling Standard Deviation of Calculated Coefficients

8 312 / Geographical Analysis to be the explanation of the good performance of phi (relative to phi/phimax and the tetrachoric) in factor analysis, despite its obvious deficiencies. The evidence from Figures 2 and 3 would seem to place phi, phi/phimax, and the tetrachoric on an equal basis for use in factor analysis whereas there is some evidence, presented above, that phi performs somewhat better. Figure 4 shows why this is so. In Figure 4 the sample standard deviation of each coefficient is presented as a function of rxv, N, and b(y), as in Figure 2. It is evident that the sample variance of phi is much smaller than that of philphimax and the tetrachoric: the latter two coefficients have standard deviations 50 percent to 100 percent larger than phi,. at least for the medium-sized coefficients, which are likely to be numerous in a real correlation matrix. Conclusions The simulation runs in Figures 2 through 4 are only a small fraction of the total examined. The others used different binarization points and (slightly) nonnormal population distributions. Those presented here are representative of the total however. The sample averages and standard deviations varied as the distributions departed from normal, but the relations among the coefficients did not. It therefore appears that phi is adequate and perhaps even superior for use with binary data in factor analysis. With increasing sample size, the advantages of phi may disappear but not so as to preclude its use. Phi has two other advantages: it does not require a bivariate normal distribution as does the tetrachoric, and it is as easy to calculate in the usual factor analysis program as is the continuous product moment coefficient (the mechanics are the same). This study has not addressed the situation where the data matrix contains a mixture of continuous and binary data. It seems that the use of the product moment formula, which produces Pearson correlation coefficients for continuous data and phi coefficients at the other extreme, should turn out satisfactory coefficients in that case also, where the severe restrictions of two-valued variables have been relaxed. LITERATURE CITED 1. CARROLL, J. B. The Nature of the Data, or How to Choose a Correlation Coefficient. Ps&wm&rika, 26 (1961), GUILFORD, J. P. Fundamental Statistics in Psychology ar;d Education. 3rd ed. New York: McGraw-Hill Book Company, KAISER, HENRY F. A Second-Generation Little Jiffy. Psychometrika, 35 (December 1970), KIRK. DAVID B. On the Numerical Approximation of the Bivariate Normal (Tetrachoric) Correlation Coefficient. Psychometrika, 38 (June 1973), NIE, NORMAN H., DALE H. BENT, and C. Lmm HULL. SPSS: Statistical Package for the Social Sciences. New York: McGraw-Hill Book Company, F EARSON, K. I. Mathematical Contribution to the Theory of Evolution, VII, On the

9 Research Notes and Comments / 313 Correlation of Characters Not Quantitatively Measurable. Phil. Tmns. Roy. SOC. London, 1901,1954, pp RUMMEL, R J. Applied Factor Analysis. Evanston: Northwestern University Press, WALKER, HELEN M., and JOSEPH LEV. StaHsticd Inference. New York: Holt, Rnehart and Winston, 1953.

Upon completion of this chapter, you should be able to:

1 Chaptter 7:: CORRELATIION Upon completion of this chapter, you should be able to: Explain the concept of relationship between variables Discuss the use of the statistical tests to determine correlation