Methodology Review: Applications of Distribution Theory in Studies of. Population Validity and Cross Validity. James Algina. University of Florida

Distribution Theory 1 Methodology eview: Applications of Distribution Theory in Studies of Population Validity and Cross Validity by James Algina University of Florida and H. J. Keselman University of Manitoba

Distribution Theory ABSTACT Applications of distribution theory for the squared multiple correlation coefficient and the squared cross-validation coefficient are reviewed and computer programs for these applications are made available. The applications include confidence intervals, hypothesis testing, and sample size selection.

Distribution Theory 3 Methodology eview: Applications of Distribution Theory in Studies of Population Validity and Cross Validity aju, Bilgic, Edward, and Fleer (1997) presented a review of methods for estimating the population squared multiple correlation coefficient and the squared cross-validation correlation coefficient. While this review was quite through in terms of estimation, aju et al. did not review distribution theory for these two coefficients. The purpose of this note is to extend the review by addressing applications of distribution theory under the assumption that scores on the dependent and independent variables are drawn from a multivariate normal distribution. The Squared Multiple Correlation Coefficient Fisher (198) derived the distribution of the sample squared multiple correlation coefficient. Since that time work has been devoted to topics such as simplified derivations of the distribution (e.g., Moran, 195), moments of the distribution (Wishart, 1931), approximate methods for evaluating the cumulative distribution function (CDF) of (Gurland & Milton, 197) or (Khatri, 1996, Lee, 1971), and exact methods for evaluating the CDF of & Bargmann, 1991; Ding, 1994). intervals for An important application of the CDF is constructions of confidence (Ding. According to Steiger and Fouladi (1997), the lower limit of a two-sided 1 1 % confidence interval for can be constructed by finding the value of such that the observed value of is the 1 th percentile

Distribution Theory 4 of the distribution of ; the upper limit can be constructed by finding the value of such that the observed value of is the th percentile. For some sample sizes n, number of predictors k, values of, and confidence levels, the 1 th percentile or the 1 th and th percentiles do not exist. In the former case the lower limit is set equal to zero and in the latter case both limits are set equal to zero. Steiger and Fouladi (199a) prepared a program that computes confidence intervals for. While the program is very easy to use, instructors who use a commercially available program such as SAS or SPSS in their course may want to use the program from their course to calculate the confidence interval. Accordingly, we have prepared SAS and SPSS programs that compute confidence intervals for. These programs use Lee s (1971) noncentral F approximation to the distribution of. According to Steiger and Fouladi (199b), results from Lee s approach are accurate to the fourth decimal place. See Algina and Olejnik () for a recent description of Lee s approach. The programs are called ci.smcc.sas and ci.smcc.sps and are available for download from url to be added. (All programs developed for this paper are available at this site.) The use of confidence interval for is very important, both as a way of assessing accuracy of estimation and as a way of acknowledging the bias of as an estimator of. For example, suppose a study based on n 3 participants with k 1 predictors yields.5. A 95% confidence interval for

Distribution Theory 5 is. to.58, indicating that the data provide relatively little information about the size of and that more of the interval is below.5 than above it. In more extreme examples, the interval can be wholly below the calculated value for. Confidence intervals can also be used to test hypotheses about example, we may be interested in presenting evidence that prescribed value. To test. For exceeds some H : H : 1 (1) we find, the 1 th percentile of the distribution of 1 in which. If exceeds 1, we can reject H. When exceeds, the lower limit of a 1 % confidence interval will be larger 1 than. Thus by constructing a 1 % confidence interval and determining whether is smaller than the lower limit, we have conducted an 1 level test of the hypotheses in equations (1). If, this confidence interval procedure is equivalent to the usual F test of the hypothesis H H : 1 : () Using a confidence interval to test the hypotheses in (1) requires calculation of. SAS (percentile.mrsq.sas) and SPSS 1 (percentile.mrsq.sas) programs are available to carry out this calculation.

Distribution Theory 6 An interesting application of the hypotheses in equation (1) is due to Browne (1975b). Consider the estimated multiple regression equation Y a b1x 1 bkx (3) k based on a sample of size n. Browne defined the mean squared error of prediction (Stein, 196) for the regression equation in (3) as d E Y Y where E denotes the expectation operator and the expectation is take over the distribution of the criterion and predictor variables with a and b1 b fixed. k Browne also defined E d, the mean square error of prediction, averaged over all possible realizations of equation (3) for a sample of size n. Browne considered a situation in which m of the k predictors are, a priori, to be included in the regression equation and the researcher wants to know whether to include the additional k m predictors. ather than testing the null hypothesis that the k m predictors do not increase the squared multiple correlation coefficient: H, Browne agued that it is more appropriate to test that adding the : k m predictors reduces the expected mean squared error of prediction H : k m H 1 : k m and shows that these hypotheses are equivalent to

Distribution Theory 7 H : km k m n m k m H1 : km, n m (4) where k m km 1 m is the partial multiple correlation coefficient and is estimated by. k m km 1 m The distribution of is the same as that of km based on k m predictors and a sample size of n m. Therefore, the hypothesis in (4) can be tested by constructing a 1 1 % confidence interval for and checking whether km or not the lower limit is greater than k m n m. If it is, then we can conclude that the additional predictors reduce the expected mean square error of prediction. Our confidence interval programs (ci.smcc.sas and ci.smcc.sps) can construct confidence intervals for Gurland and Milton (1971) to test the hypotheses in (4). km. Browne used an approximation due to As an alternative to the hypotheses in (1), we may want to test that smaller than some prescribed value : is H : H :. 1

Distribution Theory 8 We construct a 1 % confidence interval and determine whether is larger than the upper limit of the interval. If so we can reject H. Since the distribution of the squared Pearson product moment correlation coefficient is the distribution of for k 1, the procedures described above can also be used to find confidence intervals for the population squared Pearson product moment correlation coefficient and test hypotheses about. Another application of the distribution is to select sample sizes to meet a target for accurate estimation of distribution of to select sample sizes so that. Algina and Olejnik () used the will be estimated with adequate accuracy when the adjusted squared multiple correlation coefficient a is used as an estimator. Specifically, they found the sample size so that prob max, c min 1, c p. a where c and p are defined by the researcher. For example, suppose a researcher has k 8 predictors, believes that is.5, and wants to design a study so that prob max,.1 a min 1,.1.95. That is, the researcher wants a.95 probability that a is in the interval.4,.5. The required sample size is. SAS (sample.size.adjmrsq.sas) and SPSS (sample.size.adjmrsq.sps) programs that implement this approach are available. In addition, SAS (sample.size.mrsq.sas) and SPSS (sample.size.mrsq.sps) programs are available to find the sample size such that prob max, c min 1, c p.

Distribution Theory 9 For example, suppose a researcher is planning a study with k 8 predictors and believes is.5. The researcher wants to be 95 % sure that the estimate of will be within. 5 of. If is used, the required sample size is 78. a is used the required sample size is 778; if An alternative method for selecting sample size is to determine the sample size necessary to meet some target for power to test the hypotheses in equation (). Gastonis and Sampson (1989) have discussed using the distribution of to conduct such analyses. As they have pointed out, when sampling is conducted so that the values observed for the predictors would vary from sample to sample, power analyses using Fisher s distribution for is more appropriate than power analysis assuming values observed for the predictors would be fixed from sample to sample, as is done, for example, in Cohen s (1988) approach to power analysis in multiple regression. We have prepared SAS (power.mrsq.sas) and SPSS programs (power.mrsq.sps) to determine the sample size necessary to meet some target for power given and k. For example, suppose a researcher is planning a study with k 8 predictors and.1. The researcher wants power to be.95 if is.5. Calculations show that a sample size of n 45 is required. Note that using the same values for k and.5 accuracy would be 778 if, the sample size to meet a were used. While it is impossible to equate how demanding a target for power and a target for accuracy are, it is fair to say that the accuracy and power targets used in these two examples are quite demanding. As these examples illustrate, it will often be the case that a larger sample size is required to achieve target accuracy than to achieve target power.

Distribution Theory 1 The Squared Cross-Validation Correlation Coefficient The squared cross-validation correlation coefficient is defined as, that is YY as the population Pearson product moment correlation between Y and Y. The coefficient is a random variable, defined over the samples of size n that could have been drawn. Fowler (1986) stated, without proof, that the distribution of can be used to find a 1 1 % confidence interval for. First a 1 1 % confidence interval for is constructed. Then the lower limit and the upper limit of the confidence interval are substituted in Browne s (1975b) expression for the firstorder approximation to the expected value of : n p 3 4 n k k. Because this procedure starts with a 1 1 % confidence interval for and because is a monotone function of, it is clear that Fowler s procedure provides a1 1 % confidence interval for. However, it is less clear that the procedure provides a 1 1 % confidence interval for. Furthermore, although Fowler reported results that seem to substantiate the accuracy of the interval, Fowler does not describe how these results were obtained. In addition, Fowler used an approximate procedure to obtain the 1 1 % confidence interval for and reported on only a few cases. Given the limitation of this work, we conducted a more extensive simulation.

Distribution Theory 11 The design of our simulation was a 4 n 5 k 8 (nominal confidence level) crossed factorial design. The levels of the factors were n 3 to 1 in steps of 3, k to 1 in steps of,.5 to.75 in steps of.1, and confidence level =.95 and.99. All conditions were replicated 5 times. For each condition, the percentage of times the 5 intervals contained E n k n k 6 k 1 1 n p 4 n p p 4 3 (5) was determined. Equation (3) is a more accurate approximation to the expected value of than is (Browne, 1975b). In addition, the percentage of times the 5 intervals contained was also determined. We also determined the average value for over the 5 replications, to determine how well E approximated the expected value of. The approximation was very accurate. Thus, determining the adequacy of the probability coverage for E tells us how adequate the coverage probability is for the expected value of. Table 1 reports the results for the most extreme combination of sample size and number of predictors ( n 3 and k 1 ). The results indicate excellent coverage probability for the interval for the expected value of. The coverage probabilities are also good for, with a slight tendency to decline as gets large. This tendency was reduced as n increased or k decreased. It is well known that must be less than or equal to. Therefore a reasonable goal in planning a study to estimate a prediction equation is to have a

Distribution Theory 1 suitably high probability that is sufficiently close to. Algina and Keslman () used simulated data to estimate the required sample size for a number of combinations of, k, and c, where c is the maximum tolerable discrepancy between and. However, Park and Dudycha (1974) had earlier shown that Pr 1 Pr F 1, k 1, k 1, where F 1, k 1, is a noncentral F distribution with 1 and k 1 degrees of freedom and noncentrality parameter 1 : n k 1. This result can be used to analytically determine the sample size required for to be sufficiently close to. Because the researcher s goal is to have a suitably high probability that c, (s)he can select the sample size so that c k 1 Pr c 1 Pr F 1, k 1, p. c Park and Dudycha published tables that provide the required sample size for a limited number of combinations of, k, and c. We have prepared SAS (sample.size..sas) and SPSS (sample.size..sps) programs to compute the required sample. For example, suppose a researcher believes that with k 6 predictors is.3 and wants a.95 probability that the discrepancy between and is not larger than.5. That is, the researcher wants to find n such that

Distribution Theory 13.3 n 6.3.5 6 1 Pr.3.5 1 Pr F 1,6 1,.95. 1.3.5 Calculations yield n 154. Conclusion Depending on the purpose of the study, estimation of the or will be of interest. aju, Bilgic, Edward, and Fleer (1997) presented an excellent review of point estimation methods for these parameters, but did not focus on interval estimation. We have reviewed applications of the distribution theory for and, including confidence intervals, power, and sample size determination. Thus our review complements the review by aju, Bilgic, Edward, and Fleer.

Distribution Theory 14 eferences Algina, J., & Keselman, H. J. (). Cross-validation sample sizes. Applied Psychological Measurement, 4, 173-18. Algina J., & Okejnik, S. (). Determining sample size for accurate estimation of the squared multiple correlation coefficient. Multivariate Behavioral esearch, 35, 19-136. Browne, M. B. (1975a). Predictive validity of a linear regression equation. British Journal of Mathematical and Statistical Psychology, 8, 79-87. Browne, M. B. (1975b). A comparison of single sample and cross-validation methods of estimating the mean squared error of prediction in multiple linear regression. British Journal of Mathematical and Statistical Psychology, 8, 11-1. Cohen, J. (1988). Statistical power analysis for the behavioral sciences ( nd ed). Hillsdale, NJ: Lawrence Erlbaum. Ding, C. G. (1996). On the computation of the distribution of the square of the sample multiple correlation coefficient. Computational Statistics and Data Analysis,, 345-35. Ding, C. G., & Bargmann,. E. (1991). Algorithm AS 6: evaluation of the distribution of the square of the sample multiple correlation coefficient. Applied Statistics, 4, 195-198. Fisher,. A. (198). The general sampling distribution of the multiple correlation coefficient. Proceedings of the oyal Society of London, Series A, 11, 654-673.

Distribution Theory 15 Fowler,. L. (1986). Confidence intervals for the cross-validated multiple correlation in predictive regression models. Journal of Applied Psychology, 71, 318-3. Gastonis, C., & Sampson, A.. (1989). Multiple correlation: Exact power and sample size calculations. Psychological Bulletin, 16, 516-54. Gurland, J., & Milton,. (197). Further considerations of the distribution of the multiple correlation coefficient. Journal of the oyal Statistical Society, Series B (Methodological), 3, 381-394. Khatri, C. G. (1966). A note on a large sample distribution of a transformed multiple correlation coefficient. Annals of the Institute of Statistical Mathematics, 18, 375-38. Lee, Y.S. (1971). Some results on the sampling distribution of the multiple correlation coefficient. Journal of the oyal Statistical Society, Series B (Methodological), 33, 117-13. Moran, P. A. P. (195). The distribution of the multiple correlation coefficient. Proceedings of the Cambridge Philosophical Society, Mathematical and Physical Sciences, 46, 51-5. Park, C. N., & Dudycha, A. L. (1974). A cross-validation approach to sample size determination for regression models. Journal of the American Statistical Association, 69, 14-18. aju, N. S. Bilgic,., Edward, J.E., & Fleer, P.F. (1997). Methodology review: estimation of population validity and cross validity, and the use of equal weights in prediction. Applied Psychological Measurement, 1, 91-35.

Distribution Theory 16 Steiger, J. H., & Fouladi,. (199a). : A computer program for interval estimation, power calculations, sample size estimation, and hypothesis testing in multiple regression. Behavior research methods, instruments, and computers, 4, 581-58. Steiger, J. H., & Fouladi,., T. (199b). User s Guide:Version 1.1 [on-line]. http://www.interchg.ubc.ca/steiger/homepage.htm Steiger, J. H., & Fouladi,. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.). What if there were no significance tests? (pp. 1-57). Lawrence Erlbaum Associates. Stein, C. (196). Multiple regression. In I Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann, (Eds.). Contribution to probability and statistics; Essays in honor of Harold Hotelling. (pp. 44-443). Stanford: Stanford University Press. Wishart, J. (1931). The mean and second moment coefficients of the multiple correlation coefficient, in samples from a normal population, Biometrika,, 353-376.

Distribution Theory 17 Footnote 1 In equation (4.16), Park and Dudycha (1974) report that the noncentrality parameter for the F distribution is rather than. However, their derivation indicates the noncentrality is. In addition they prepared their tables using the noncentral t distribution with noncentrality parameter, which also suggests the noncentrality parameter for the F distribution is. Finally we replicated their tables using the F distribution with as the noncentrality parameter.

Distribution Theory 18 Table 1 Estimated Confidence Coefficients for 95% Confidence Intervals on n 3 and k 1 Nominal Confidence Level.95.99 and : E E.5.9454.9486.9918.996.15.9496.95.99.99.5.9448.9444.9918.9914.35.9458.948.9896.988.45.954.946.9876.989.55.9496.94.99.986.65.9456.9336.998.9866.75.954.9358.9876.9838 Note. is the population squared multiple correlation coefficient and k is the number of predictors.