Journal of Educational and Behavioral Statistics

Journal of Educational and Behavioral Statistics http://jebs.aera.net Theory of Estimation and Testing of Effect Sizes: Use in Meta-Analysis Helena Chmura Kraemer JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS 1983 8: 93 DOI: 10.3102/10769986008002093 The online version of this article can be found at: http://jeb.sagepub.com/content/8/2/93 Published on behalf of American Educational Research Association and http://www.sagepublications.com Additional services and information for Journal of Educational and Behavioral Statistics can be found at: Email Alerts: http://jebs.aera.net/alerts Subscriptions: http://jebs.aera.net/subscriptions Reprints: http://www.aera.net/reprints Permissions: http://www.aera.net/permissions Citations: http://jeb.sagepub.com/content/8/2/93.refs.html >> Version of Record - Jan 1, 1983 What is This?

Journal of Educational Statistics Summer 1983, Volume 8, Number 2,pp. 93-101 THEORY OF ESTIMATION AND TESTING OF EFFECT SIZES: USE IN META-ANALYSIS HELENA CHMURA KRAEMER Stanford University Key words: Meta-analysis, Research Synthesis, Effect Sizes ABSTRACT. Approximations to the distribution of a common form of effect size are presented. Single sample tests, confidence interval formulation, tests of homogeneity and pooling procedures are based on these approximations. Caveats are presented concerning statistical procedures as applied to sample effect sizes commonly used in meta-analysis. When policy decisions are based on research, such decisions are generally not based on one research study, but on the consensus of many, nor on statistical significance alone, but on evaluation of practical significance (i.e., cost-effectiveness) as well. In recent years there has been growing emphasis on systematic syntheses of the results of research studies, meta-analysis (Glass, McGaw, & Smith, 1981), based on the use of quantitative measures indicating practical as well as statistical significance: effect sizes. The application of such methods has, in many ways, preceded the development of sound mathematical theory necessary to implement its application. Consequently, there is much controversy and doubt about the validity of results based on meta-analysis. The focus here is on the most frequently used sample effect size: d= (x E -x c )/S, where x E is the mean response of n E subjects in an experimental group, x c that of n c subjects in a separate control group, and S 2 is a sample variance. Generally, it is assumed that x Ei ~%(ii E9 o ) 9 /= 1,2,...,*, x C/ ~9l(/x c,a<?), /= 1,2,...,w c, 2 2 Further, S 2 is an estimate of a 2, independent of x E and x c with Under these assumptions (Winer, 1971), [Np(l-p)V /2 d~t> r ([Np(l-p)y 2 8), 93

94 Helena Chmura Kraemer where N = n E + n c is the total sample size; p = n E /N reflects balance; = di E jti c )/a is the population effect size; and ^(X) represents a noncentral /-distribution with v degrees of freedom and noncentrality parameter X. If S 2 is the pooled within-group variance, v N 2. If S 2 is based on the control group alone, v = n c 1. The mathematical form of the noncentral / distribution is known, and has been tabled (Resnikoff & Lieberman, 1957). Use of either the exact form or tables as a basis of statistical analysis of effect sizes is extremely difficult. However, when n E and n c both exceed about 10, accurate approximations to the noncentral / distribution are available and provide an accessible approach to applications. When N is small, or when groups sizes are disparate, or when S 2 is injudiciously chosen, distribution theory is highly sensitive to departures from the assumptions (Scheffe, 1959). Interpretation of 8 also becomes problematic (Kraemer & Andrews, 1982). Finally, the approximation procedures are of doubtful accuracy. For all these reasons, attention is here restricted only to the situation in which N>20, when A^p<.6 and when S 2 is the pooled within-group variance (v = N 2). To test the null hypothesis H 0 : 8 = 0 versus 8 > 0, one would reject H 0 at the a-level of significance if [Np(l-p)V /2 d>^a, where t v a is the upper a-level critical value of the / distribution with v degrees of freedom. Since computation of effect size is usually done when one has an a priori reason to believe 8 > 0, this particular test is minimally important. More useful are procedures, in any single study, to: 1. Test the null hypothesis H Q : 8 <8 0 versus A: 8 > 8 0, and compute the power of this test; 2. compute a confidence interval for 8; or, in meta-analysis, to: 3. test the homogeneity of 8,, 8 2,...,8 m ; 4. obtain a pooled estimate and compute a confidence interval for 8 on the basis of a set of sample effect sizes d x,d 2,...,d m \ and 5. to examine statistically those factors producing heterogeneity of effect sizes. Approximation to the Distribution of Sample Effect Size One approximation with little mathematical justification (cf. Hedges, 1982a) is: [Np(l - p)v /2 (d - S) ^91(0,1).

Theory of Estimation and Testing of Effect Sizes 95 Yet this approximation underlies many of the statistical procedures commonly used in meta-analysis (e.g., Hsu, 1980). It should be noted that 1. rrjv.(^/2) 1/2 r((.-i)/2) E { d ) ~ W ) Hence d always overestimates 8 (Hedges, 1981). 2. For large sample sizes, var(</)~- 1 + ^ p{\~p) 2 (Hedges, 1981). Thus, not only is the variance consistently underestimated, but the variance is not independent of 8. For this reason, applications of test procedures to sample effect sizes such as / tests, analysis of variance, or linear regression, which assume homoscedasticity, are of questionable validity. 3. The distribution of d is both skewed and heavy tailed and is here approximated by the normal distribution which is neither. Better approximations are the Johnson-Welch procedure (1940) and the Kraemer-Paik procedure (1979). The Johnson-Welch procedure describes a normal approximation to the noncentral / distribution which, applied to effect size, would yield:?r{d^d) «Pr{z>[tf/>(l ~ P)Y /2 (D - 8)/[l +Z> 2 /2/] 1/2 }, where /= v/np{\ p) (/«4). This procedure justifies tests such as those proposed by Hedges (1982a, 1982b, 1982c). More accurate for small noncentrality parameters, and more useful in this context, is the Kraemer-Paik procedure (1979). Applied to effect sizes, this procedure indicates that if and r = d/(d*+f) l/l, p = «/(«2 +/) 1/2 then w(r, p) = u (r p)/(l rp) is approximately distributed according to the null distribution of the product moment correlation coefficient; that is, v x / 2 u/{\-u 2 ) X/2 ~t v. Thus percentile points of w(r, p), say C v P, where Pr{ W (r,p)>c^}=/>,

96 Helena Chmura Kraemer are tabled (Fisher & Yates, 1957). Alternatively, since c v, P = t^p/(tl P + p) l/ \ percentile points of w(r, p) may be computed from tables of the / distribution. Furthermore, since w(r, p) is distributed as is the product moment correlation coefficient, Fisher's z transformation; that is, z(u) = 2^(1^) = tanh ~ lw is both a variance-stabilizing and a normalizing transformation. Since then z(u) = z(r) - z(p), Computing percentile points is somewhat less accurate and more tedious using this transformation, rather than using C v P directly, but there are many other applications in which having a variance independent of the mean is crucial. Single Sample Tests Consider the null hypothesis H 0 : 8 < S 0 versus A: 8 > 8 0. One would reject the null hypothesis at the a level of significance if: where z(r)^%(z(p), v^- [ ). u{r,p 0 )^C va, that is, if or in terms of d: PO = V(«O 2 +/)' /2 ; 'MQ,«+ Po)/(i + Q,«Po)> <*>«;.. + po)/[(i - Q 2, a )(i - PDY /2. The power of this test at 5, > 8 0 can easily be estimated, for Power(5,) = Pr{r > (C,, + p 0 )/(1 + C v, apo ) p = p,} = Pr{«(r, Pl) > «[(<;, + Po)/(l + C,, apo ), p,] p = p,} = Pr{«, (r, Pl) > (C,, a - A)/(l - C,, A) p = p,}, where A = (Pi -Po)/(l -PiPo)-

Theory of Estimation and Testing of Effect Sizes 97 Thus, since r'/ 2 «(r,p.)/[l-«2 (^.Pi)],/2 ~^ Power(S,) = Vr{t, > v^(c a - A)/[(l - C*.)(l - A 2 )] 1/2 ). Table I presents values of A as a function of v and P. One notes that to detect a separation between effect sizes yielding A =.4, one needs approximately a sample size of 60 for 95 percent power, about 50 for 90 percent power, about 35 for 80 percent power, and so forth. For a sample size of 20 (v = 18) one has less than an even chance of detecting such a separation. The import of this observation is clear only if one realizes how the metric A reflects separation between effect sizes. In Table II are presented A for paired values of S 0 < 8,. A separation between effect size S 0 = 0 and 8 } =.8, that is, between null effect size and one quite large as compared to those TABLE I Table of A to Achieve a Power of P with v Degrees of Freedom (N = v 4-2) T^ 18 20 25 30 40 60 100 50%.378.360.323.296.258.211.150 60.429.409.368.338.295.242.172 70.481.459.415.381.333.275.196 80.537.514.467.430.378.313.244 90.609.584.534.495.436.363.262 95%.662.637.585.544.483.404.292 TABLE II A as a Function of8 0,8 l (/ = 4) \ * 1 0.2.4.6.8 1.0 1.2 1.4 1.6 1.8 2.0 0 0.100.196.287.371.447.514.573.625.669.707.2 0.099.193.282.364.437.503.560.610.654.4 0.097.189.275.354.425.488.544.593.6 0.094.183.267.343.411.472.527.8 0.091.177.257.330.396.455 1.0 0.087.170.246.316.380 1.2 0.084.162.236.303 1.4 0.080.155.225 1.6 0.076.148 1.8 0.072 2.0 0

98 Helena Chmura Kraemer reported in the literature, yields A =.371. Consequently, with a sample size of 20, one would have less than an even chance of discriminating no effect from a large effect. Once again, from yet another viewpoint, there is serious difficulty in using effect sizes when sample sizes are small. Confidence Interval for 8 Since w(r, p) has a distribution independent of p, one- or two-sided confidence intervals for p and hence for 5=/'/V0-p 2 ) 1/2 are readily obtained. For example, where Thus or r = d/(di+f) l/ \ Pr{ W (r,p)<q, tt }~l-«. Pr{p > (r - C)/(l - rc)} «1 - a, C = C va9 Pr{«^/! / 2 (r - C)/[(l - r 2 )(l - C 2 )] 1/2 } ~ 1 - a. It can readily be verified that for small N, confidence intervals for S will be very wide. For example, if one observed d = 1.0 (r =.45) for N = 20 {v = 18, C.Q5-38), the one-tailed 95 percent confidence interval for 8 will be 8>2(.4S-.38) = 1? (.80 X.86) 1/2 A Test of Homogeneity/Pooling Effect Sizes If d l9 d 2,...,d m are independent sample effect sizes (i.e., based on different samples), one might wish to test whether they all estimate the same effect size, that is, H 0 : S x =8 2 = = 8 m (cf. Hedges, 1982a, 1982b). Because it is assumed that/«4 for all included effect sizes, this null hypothesis is equivalent to H 0 : p x p 2 ' P m where P, = V(«, 2 + 4)' /2. This, then, becomes a test of homogeneity of correlation coefficients. One estimates the common p, under the null hypothesis, by p where The test statistic is 2M - 1) m *=2(',-i)(*(/;)-*0>)) 2,

Theory of Estimation and Testing of Effect Sizes 99 which, under i/ 0, has approximately a x^distribution with (m - 1) degrees of freedom (Kraemer, 1975, 1979). Furthermore, then the pooled estimate of 8 is 8 where 5 = 2p/(l-p 2 ),/2. Then z(f>) ~ 9t(z(p), \/m.(v - 1)), where p = 2,-^/m.. On this basis, tests or confidence intervals for p are readily formulated. Note that $ is not a weighted average old x,d 2,...,d m and that 8 itself is not normally distributed. These are important points to consider in meta-analysis because pooling effect sizes is usually implemented by using a weighted average of d x,..., d m say 4* = 2«,-4/2«,-» where co, are positive weights based on sample sizes. This statistic is asymptotically normally distributed if the sample size underlying each d i is large or if m, the number of studies, is large, but the number of subjects per study and the number of different research studies is rarely large enough to warrent using asymptotic theory. If all the studies are small, even asymptotically as the number of studies increases, d u is biased. However, as Hedges points out (1981), one might replace d i by an unbiased estimate of 8. Even then, however, the variance of d i9 is not independent of the unknown 8. Obtaining valid confidence intervals or tests based on d u under these circumstances is indeed a problem, particularly when sample sizes are small. General Applications There are many other statistical questions related to use of sample effect sizes in meta-analysis. One might compile effect sizes for each of g interventions and wish to compare these using t tests or analysis of variance. One might examine characteristics of the studies yielding different effect sizes (size of class, intensity or duration of intervention, length of follow-up, etc.) and wish to assess the influence of such factors on effect size using Multiple Linear Regression. All such evaluations are questionable when applied to sample effect sizes directly (cf. Hedges, 1982c). The above statistical considerations, however, suggest a strategy to implement such procedures, namely: (1) Studies with group size less than 10 or which are seriously unbalanced (p <.4, p >.6) should be set aside. Such studies may be valid for purposes of testing, but estimation of effect sizes from such studies, for reasons detailed above, are problematic. (2) Only one effect size per study can be used to ensure independence.

100 Helena Chmura Kraemer (3) All remaining effect sizes are transformed as follows: r l = rf / /(rf l 2 + 4) 1/2 f *,- = *(*/) Analytic procedures are then applied, not to d i9 but to z,. The effect of this transformation is to attenuate size. When d t is small, there is little change; that is, z, «d r When d t =.80, for example, z,- =.78. When d i = 2.0, z, = 1.8. When d t = 10.0, z,. = 4.63. Sample effect size, </,., has a skew distribution with heavy tails; z, has approximately a normal distribution with mean z(p) 9 (p = 5/(^2 + 4) 1/2 ) and variance equal (*> 1) _1. Hedges and Olkin (1981) suggest a somewhat different variance stabilizing transformation, one essentially based on the Johnson-Welch approximation to the distribution of d. They suggest (in the balanced case) using with Here we suggest using: with z H (d) = j2smh- l (d\2jl) z H (d)~%(z lf (S),l/N). z K (d) = tanh- 1 (rf/y / </ 2 + 4), z K (d)^%(z K (S),l/(N-3)). In Table III are presented values of Z H and Z K for effect sizes d 0 to 2.0. For the typical range of effect sizes, only for relatively small sample size will results based on the two approaches differ. In any one study, because effect sizes are generally small, use of either transformation rather than d will make little difference. However, the relatively TABLE III Comparison of Two Variance Stabilizing Transformations for d: z K,z H. d Z k Z H 0 0 0.2.100.100.4.199.199!.6.296.298 8.390.395 1.0.481.490 1.2.569.583 1.4.653.674 1.6.733.763 1.8.809.848 2.0.881.931

Theory of Estimation and Testing of Effect Sizes 101 small errors incurred by using d rather than z K or z H, since they tend to be in the same direction, cumulate over studies in a meta-analysis and do not cancel each other out. As a result, there may be major impact on what inferences are drawn from meta-analysis and ultimately what recommendations are based thereon. References Fisher, R. A., & Yates, F. Statistical tables. London: Oliver and Boyd, 1957. Glass, G. V, Primary, secondary and meta-analysis of research. Educational Researcher, 1976,5,3-8. Glass, G. V, McGaw, B., & Smith, M. L. Meta-analysis in social research. Beverly Hills: Sage, 1981. Hedges, L. V. Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics, 1981, 6(2), 107-128. Hedges, L. V. Estimation and testing for differences in effect size: Comment on Hsu. Psychological Bulletin, 1982, 97, 391-393. (a) Hedges, L. V. Estimation of effect size from a series of independent experiments. Psychological Bulletin, 1982, 92, 490-499. (b) Hedges, L. V. Fitting categorical models to effect sizes from a series of experiments. Journal of Educational Statistics, 1982, 7, 119-137. (c) Hedges, L. V., & Olkin, I. Clustering estimates of effect magnitude from independent studies (Technical Report No. 173). Stanford, Calif: Stanford University, April, 1981. Hsu, L. M. Tests of differences in /^-levels as tests for differences in effect size. Psychological Bulletin, 1980, 88, 705-708. Johnson, N. L., & Welch, B. L. Applications of the noncentral /-distribution. Biometrika, 1940,57,362-389. Kraemer, H. C. On estimation and hypothesis testing problems for correlation coefficients. Psychometrika, 1975, 40(4), 473-485. Kraemer, H. C. Tests of homogeneity of independent correlation coefficients. Psychometrika, 1979, 44(3), 329-335. Kraemer, H. C, & Andrews, G. A non-parametric technique for meta-analysis effect size calculation. Psychological Bulletin, 1982, 97(2), 404-412. Kraemer, H. C, & Paik, M. A central / approximation to the noncentral /-distribution. Technometrics, 1979, 27(3), 357-360. Resnikoff, G. J., & Lieberman, G. J. Tables of the noncentral t-distribution. Stanford. Calif.: Stanford University Press, 1957. Scheffe, H. The analysis of variance. New York: John Wiley & Sons, 1959. Winer, B. J. Statistical principles in experimental design (2d ed.). New York: McGraw-Hill Book Company, 1971. Author KRAEMER, HELENA CHMURA. Associate Professor of Biostatistics in Psychiatry. Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, California, 94306. Specializations: Statistical applications in bio-behavioral areas, particularly of correlation techniques.