Assessing heterogeneity in meta-analysis: Q statistic or I2 index?

Size: px

Start display at page:

Download "Assessing heterogeneity in meta-analysis: Q statistic or I2 index?"

Clarissa Maxwell
5 years ago
Views:

1 Center for Health, Interventon, and Preventon (CHIP) CHIP Documents Unversty of Connectcut Year 006 Assessng heterogenety n meta-analyss: Q statstc or I ndex? Tana Huedo-Medna Julo Sanchez-Meca Fulgenco Marn-Martnez Juan Botella Unversty of Connectcut, tana.huedo-medna@uconn.edu Unversty of Murca, Spane Unversty of Murca, Span Autonoma Unversty of Madrd, Span Ths paper s posted at DgtalCommons@UConn. docs/19

2 ASSESSING HETEROGENEITY IN META-ANALYSIS: Q STATISTIC OR I INDEX? Tana B. Huedo-Medna, 1 Julo Sánchez-Meca, 1 Fulgenco Marín-Martínez, 1 and Juan Botella Runnng head: Assessng heterogenety n meta-analyss Unversty of Murca, Span Autónoma Unversty of Madrd, Span Address for correspondence: Tana B. Huedo-Medna Dept. of Basc Psychology & Methodology, Faculty of Psychology, Espnardo Campus, Murca, Span Phone: Fax: E-mal: hmtana@um.es * Ths work has been supported by Plan Naconal de Investgacón Centífca, Desarrollo e Innovacón Tecnológca from the Mnstero de Educacón y Cenca and by funds from the Fondo Europeo de Desarrollo Regonal, FEDER (Proyect Number: SEJ /PSIC).

3 Assessng heterogenety n meta-analyss ASSESSING HETEROGENEITY IN META-ANALYSIS: Q STATISTIC OR I INDEX? Abstract In meta-analyss, the usual way of assessng whether a set of sngle studes are homogeneous s by means of the Q test. However, the Q test only nforms us about the presence versus the absence of heterogenety, but t does not report on the extent of such heterogenety. Recently, the I ndex has been proposed to quantfy the degree of heterogenety n a meta-analyss. In ths paper, the performances of the Q test and the confdence nterval around the I ndex are compared by means of a Monte Carlo smulaton. The results show the utlty of the I ndex as a complement to the Q test, although t has the same problems of power wth a small number of studes. KEY WORDS: Meta-analyss, effect sze, heterogenety, I ndex, Monte Carlo method.

4 Assessng heterogenety n meta-analyss 3 ASSESSING HETEROGENEITY IN META-ANALYSIS: Q STATISTIC OR I INDEX? In the last 5 years meta-analyss has been wdely accepted n the socal and health scences as a very useful research methodology to quanttatvely ntegrate the results of a collecton of sngle studes on a gven topc. In a meta-analyss the result of every study s quantfed by means of an effect-sze ndex (e.g., standardzed mean dfference, correlaton coeffcent, odds rato, etc.) that can be appled to all studes, enablng us to gve the study results n the same metrc (Cooper, 1998; Cooper & Hedges, 1994; Egger, Smth, & Altman, 001; Glass, McGaw, & Smth, 1981; Hedges & Olkn, 1985; Hunter & Schmdt, 004; Rosenthal, 1991; Sutton, Abrams, Jones, Sheldon, & Song, 000; Whtehead, 00). Typcally, meta-analyss has three man goals: (a) to test whether the studes results are homogeneous, (b) to obtan a global ndex about the effect magntude of the studed relaton, joned to a confdence nterval and ts statstcal sgnfcance, and (c) f there s heterogenety among studes, to dentfy possble varables or characterstcs moderatng the results obtaned. Here, we focus on how to assess the heterogenety among the results from a collecton of studes. Bascally, there can be two sources of varablty that explan the heterogenety n a set of studes n a meta-analyss. One of them s the varablty due to samplng error, also named wthn-study varablty. The samplng error varablty s always present n a meta-analyss, because every sngle study uses dfferent samples. The other source of heterogenety s the between-studes varablty, whch can appear n a meta-analyss when there s true heterogenety among the populaton effect szes estmated by the ndvdual studes. The between-studes varablty s due to the nfluence of an ndetermnate number of characterstcs that vary among the studes, such as those related to the characterstcs of the samples, varatons n the treatment, n the desgn qualty, and so on (Brockwell & Gordon, 001; Erez, Bloom, & Wells, 1996; Feld, 003; Hunter & Schmdt, 000; Natonal Research Councl, 199). To assess the heterogenety n meta-analyss s a crucal ssue because the presence versus the absence of true heterogenety (between-studes varablty) can affect the statstcal model that the meta-analyst decdes to apply to the meta-analytc

5 Assessng heterogenety n meta-analyss 4 database. So, when the studes results only dffer by the samplng error (homogeneous case) a fxed-effects model can be appled to obtan an average effect sze. By contrast, f the study results dffer by more than the samplng error (heterogeneous case), then the meta-analyst can assume a random-effects model, n order to take nto account both wthn- and between-studes varablty, or can decde to search for moderator varables from a fxed-effects model (Feld, 001, 003; Hedges, 1994; Hedges & Olkn, 1985; Hedges & Vevea, 1998; Overton, 1998; Raudenbush, 1994). The usual way of assessng whether there s true heterogenety n a metaanalyss has been to use the Q test, a statstcal test defned by Cochran (1954). The Q test s computed by summng the squared devatons of each study s effect estmate from the overall effect estmate, weghtng the contrbuton of each study by ts nverse varance. Under the hypothess of homogenety among the effect szes, the Q statstc follows a ch-square dstrbuton wth k 1 degrees of freedom, k beng the number of studes. Not rejectng the homogenety hypothess usually leads the meta-analyst to adopt a fxed-effects model because t s assumed that the estmated effect szes only dffer by samplng error. In contrast, rejectng the homogenety assumpton can lead to applyng a random-effects model that ncludes both wthn- and between-studes varablty. A shortcomng of the Q statstc s that t has poor power to detect true heterogenety among studes when the meta-analyss ncludes a small number of studes and excessve power to detect neglgble varablty wth a hgh number of studes (Alexander, Scozzaro, & Borodkn, 1989; Cornwell, 1993; Cornwell & Ladd, 1993; Hardy & Thompson, 1998; Harwell, 1997; Osburn, Callender, Greener, & Ashworth, 1983; Paul & Donner, 199; Sackett, Harrs, & Orr, 1986; Sage & Koslowsky, 1993; Sánchez-Meca & Marín-Martínez, 1997; Spector & Levne, 1987). Thus, a nonsgnfcant result for the Q test wth a small number of studes can lead a revewer to erroneously assume a fxed-effects model when there s true heterogenety among the studes; and vce versa. On the other hand, the Q statstc does not nform us of the extent of true heterogenety, only of ts statstcal sgnfcance. 1 1 It s mportant to note that the low statstcal power of the Q test for small number of studes has promoted the undesrable practce among some meta-analysts of gnorng the results of Q when t s not statstcally sgnfcant, and searchng for moderator varables. On the other hand, the meta-analyst can a pror adopt a statstcal model (fxed- or random-effects model) on conceptual grounds. For example, f the meta-analyst wshes to generalze the meta-analytc results to a populaton of studes wth smlar characterstcs than those of represented n the meta-analyss, a fxed-effects model can be selected. If, on

6 Assessng heterogenety n meta-analyss 5 Another strategy for quantfyng the true heterogenety n a meta-analyss conssts of estmatng the between-studes varance, τ. Assumng a random-effects model, the between-studes varance reflects how much the true populaton effect szes estmated n the sngle studes of a meta-analyss dffer. As the τ depends on the partcular effect metrc used n a meta-analyss, t s not possble to compare the τ values estmated from meta-analyses that have used dfferent effect-sze ndces (e.g., standardzed mean dfferences, correlaton coeffcents, odds ratos, etc.). In order to overcome the shortcomngs of the Q test and the τ, Hggns and Thompson (00; see also Hggns, Thompson, Deeks, & Altman, 003) have proposed three ndces for assessng heterogenety n a meta-analyss: the H, R, and I ndces. As they are nter-related, here we focus on the I ndex, because of ts easy nterpretaton. The I ndex measures the extent of true heterogenety dvdng the dfference between the result of the Q test and ts degrees of freedom (k 1) by the Q value tself, and multpled by 100. So, the I ndex s smlar to an ntraclass correlaton n cluster samplng (Hggns & Thompson, 00). The I ndex can be nterpreted as the percentage of the total varablty n a set of effect szes due to true heterogenety, that s, to between-studes varablty. For example, a meta-analyss wth I = 0 means that all varablty n effect sze estmates s due to samplng error wthn studes. On the other hand, a meta-analyss wth I = 50 means that half of the total varablty among effect szes s caused not by samplng error, but by true heterogenety between studes. Hggns and Thompson (00) proposed a tentatve classfcaton of I values wth the purpose of helpng to nterpret ts magntude. Thus, percentages of around 5% (I = 5), 50% (I = 50), and 75% (I = 75) would mean low, medum, and hgh heterogenety, respectvely. The I ndex and the between-studes varance, τ, are drectly related: the hgher the τ, the hgher the I ndex. However, followng Hggns and Thompson (00), an advantage of the I ndex n respect to τ s that I ndces obtaned from meta-analyses wth dfferent numbers of studes and dfferent effect metrcs are drectly comparable. the contrary, the meta-analytc results have to be generalzed to a wder populaton of studes, a randomeffects model should be the best opton (Feld, 001; Hedges & Vevea, 1998).

7 Assessng heterogenety n meta-analyss 6 Together wth ths descrptve nterpretaton of the I ndex, Hggns and Thompson (00) have derved a confdence nterval for t that mght be used n the same way as the Q test s used to assess heterogenety n meta-analyss. Thus, f the confdence nterval around I contans the 0% value, then the meta-analyst can hold the homogenety hypothess. If, on the contrary, the confdence nterval does not nclude the 0% value, then there s evdence for the exstence of true heterogenety. Usng the I ndex and ts confdence nterval s smlar to applyng the Q test. Because the I ndex assesses not only heterogenety n meta-analyss, but also the extent of that heterogenety, t should be a more advsable procedure than the Q test n assessng whether or not there s true heterogenety among the studes n a meta-analyss. However, the performance of the confdence nterval around I has not yet been studed n terms of the control of Type I error rate and statstcal power. The purpose of ths paper s to compare, by a Monte Carlo smulaton, the performance of the Q test and the confdence nterval around the I ndex, n terms of ther control of Type I error rate and statstcal power. Dfferent effect-sze ndces were used and both the extent of true heterogenety and the number of studes were vared. Thus, t s possble to test whether the confdence nterval for I overcomes the shortcomngs of the Q test. Effect-sze ndces For each ndvdual study, we assume two underlyng populatons representng the expermental versus control groups on a contnuous outcome. Let µ E and µ C be the expermental and control populaton means, and σ E and σ C the populaton standard devatons, respectvely. By ncludng a control condton n the typcal desgn we restrct the applcablty of our results to research felds n whch such desgns make sense (e.g., treatment outcome evaluaton n behavoral scences, educaton, medcne, etc.). Under the assumptons of normal dstrbutons and homoscedastcty, the usual parametrc effect-sze ndex s the standardzed mean dfference, δ, defned as the dfference between the expermental and control populaton means, µ E and µ C, dvded by the pooled populaton standard devaton, σ (Hedges & Olkn, 1985, p. 76, eq. ),

8 Assessng heterogenety n meta-analyss 7 δ µ E µ C =. (1) σ The best estmator of the parametrc effect sze, δ, s the sample standardzed mean dfference, d, proposed by Hedges and Olkn (1985, p. 81, eq. 10) and computed by d y yc = c( m), () S E wth y E and yc beng the sample means of the expermental and control groups, respectvely, and S beng a pooled estmate of the wthn-group standard devaton, gven by (Hedges & Olkn, 1985, p. 79), ( n 1) S + ( n 1) E E C SC S =, (3) n + n E C wth S E, S C, n E, and n C beng the sample varances and the sample szes of the expermental and control groups, respectvely. The term c(m) s a correcton factor for the postve bas suffered by the standardzed mean dfference wth small sample szes and estmated by (Hedges & Olkn, 1985, p. 81, eq. 7), 3 c(m) = 1, (4) 4m 1 wth m = n E + n C. The samplng varance of the d ndex s estmated by Hedges and Olkn (1985, p. 86, eq. 15) as S d = ne + n n n E C C + ( n + n ) E d C. (5) Another effect-sze ndex from the d famly s that proposed by Glass et al. (1981; see also Glass, 1976), consstng of dvdng the dfference between the

9 Assessng heterogenety n meta-analyss 8 expermental and control group means by the standard devaton of the control group. Here we wll represent ths ndex by g (Glass et al., 1981, p. 105): y g = c ) y E C ( m, (6) S C where S C s the estmated standard devaton of the control group and c(m) s the correcton factor for small sample szes gven by equaton (4), but wth m = n C 1 (Glass et al., 1981, p. 113). The g ndex s recommended when the homoscedastcty assumpton s volated. Glass et al. (1981) proposed dvdng the mean dfference by the standard devaton of the control group because the expermental manpulaton can change the varablty n the group; thus, under ths crcumstance they argue that t s better to estmate the populaton standard devaton by the control group standard devaton. Therefore, n the strct sense, the g ndex s estmatng a dfferent populaton effect sze from that defned n equaton (1), δ, consstng n dvdng the mean dfference by the populaton standard devaton of the control group: δ C = (µ E - µ C )/σ C (Glass et al., 1981, p. 11). The samplng varance of the g ndex s gven by Rosenthal (1994, p. 38) as ne + nc g S g = +. (7) n n E C ( n 1) C The statstcal model Once an effect-sze estmate s obtaned from each ndvdual study, meta-analyss ntegrates them by calculatng an average effect sze, assessng the statstcal heterogenety around the average estmate, and searchng for moderator varables when there s more heterogenety than can be explaned by chance. In general, the most realstc statstcal model to ntegrate the effect estmates n a meta-analyss s the random-effects model, because t ncorporates the two possble sources of heterogenety Although Glass et al. (1981) represented ths effect-sze ndex wth the Greek symbol, here we prefer to keep Greek symbols to represent parameters, not estmates. Thus, we have selected the Latn letter g to represent ths effect-sze ndex.

10 Assessng heterogenety n meta-analyss 9 among the studes n a meta-analyss: frst, statstcal varablty caused by samplng error and, second, substantve varablty. Let T be the th effect estmate n a collecton of k studes ( = 1,,..., k). Here T corresponds to the d and g effect ndces defned n Secton by equatons () and (6), respectvely. In a random-effects model t s assumed that every T effect s estmatng a parametrc effect sze, θ, wth condtonal varance The estmated condtonal varances, σ, estmated by ˆσ. ˆσ, for the d and g ndces proposed n Secton are defned by equatons (5) and (7), respectvely. The model can be formulated as T = θ + e, where the errors, e, are normally and ndependently dstrbuted wth mean zero and varance σ [e N(0, σ )]. The condtonal varance represents the wthnstudy varablty, that s, the varablty produced by random samplng. In turn, the parametrc effect szes θ pertan to an effect-parameter dstrbuton wth mean µ θ and uncondtonal varance τ. So, every θ parameter can be defned as θ =µ + u, where t s usually assumed that the errors u are normally and θ ndependently dstrbuted wth mean zero and varance τ [u N(0, τ )]. The uncondtonal varance, τ, represents the extent of true heterogenety among the study effects produced by the nfluence of an nnumerable number of substantve (e.g., type of treatment, characterstcs of the subjects, settng, etc.) and methodologcal (e.g., type of desgn, attrton, sample sze, random versus non-random assgnment, etc.) characterstcs of the studes (Lpsey, 1994). Therefore, the random-effects model can be formulated as (Hedges & Vevea, 1998; Overton, 1998; Raudenbush, 1994): T = µ θ + u + e, (8) where the errors u and e represent the two varablty sources affectng the effect estmates, T, and quantfed by the between-studes, τ, and wthn-study, σ, varances. Therefore, the effect estmates T wll be normally and ndependently dstrbuted wth mean µ θ and varance τ + σ [T N(µ θ, τ + σ )].

11 Assessng heterogenety n meta-analyss 10 When there s no true heterogenety among the effect estmates, then the between-studes varance s zero (τ = 0), and there only wll be varablty due to samplng error, whch s represented n the model by the condtonal wthn-study varance, σ. In ths case, all the studes estmate one parametrc effect sze, θ = θ, and the statstcal model smplfes to T = θ + e, thus becomng a fxed-effects model. So, the fxed-effects model can be consdered as a partcular case of the random-effects model when there s no between-studes varablty and, as a consequence, the effect estmates, T, are only affected by samplng error, σ, followng a normal dstrbuton wth mean θ (beng n ths case θ = µ θ ) and varance σ [T N(θ, σ )] for large sample szes. Assessng the extent of heterogenety n a meta-analyss helps to decde whch of the two models s the most plausble and ths decson affects, at least, the weghtng factor used to obtan an average effect sze. The usual estmate of a mean effect sze conssts of weghtng every effect estmate, T, by ts nverse varance, w : T = wt w. (9) In a fxed-effects model, the weghtng factor for the th study s estmated by w = 1/ ˆσ. In a random-effects model, the weghts are estmated by d and g ndces the estmated wthn-study varances, w = 1/(τ ˆ +σ ˆ ). For the ˆσ, are defned n equatons (5) and (7), respectvely. A commonly used estmator of the between-studes varance, τ, s an estmator based on the method of moments proposed by DerSmonan and Lard (1986): ˆ τ = Q ( k 1) c 0 for Q > ( k -1) for Q ( k -1) (10)

12 Assessng heterogenety n meta-analyss 11 beng c w c = w w (11) where w s the weghtng factor for the th study assumng a fxed-effects model (w = 1/ ˆσ ), k s the number of studes, and Q s the statstcal test for heterogenety proposed by Cochran (1954) and defned n equaton (1). To avod negatve values for Q (k 1), for τ. ˆτ s equated to 0. Note that due to ths truncaton, τˆ ˆτ when s a based estmator Assessng heterogenety n meta-analyss Quantfyng the extent of heterogenety among a collecton of studes s one of the most troublesome aspects of a meta-analyss. It s mportant because t can affect the decson about the statstcal model to be selected, fxed- or random-effects. On the other hand, f sgnfcant varablty s found, potental moderator varables can be sought to explan ths varablty. The between-studes varance, τ, s the parameter n the statstcal model that manly represents the true (substantve, clncal) heterogenety among the true effects of the studes. Therefore, a good procedure for determnng whether there s true heterogenety among a collecton of studes should be postvely correlated wth τ. At the same tme, t should not be affected by the number of studes, and should be scalefree n order to be comparable among meta-analyses that have appled dfferent effectsze ndces. The statstcal test usually appled n meta-analyss for determnng whether there s true heterogenety among the studes effects s the Q test, proposed by Cochran (1954) and defned as (Hedges & Olkn, 1985, p. 13, eq. 5): ( ), (1) Q = w T T

13 Assessng heterogenety n meta-analyss 1 where w s the weghtng factor for the th study assumng a fxed-effects model, and T s defned n equaton (9). If we assume that the condtonal wthn-study varances, are known 3, then under the null hypothess of homogenety (H o : δ 1 = δ =... = δ k ; or also H o : τ = 0), the Q statstc has a ch-square dstrbuton wth k 1 degrees of freedom. Thus, Q values hgher than the crtcal pont for a gven sgnfcance level (α) enable us to reject the null hypothess and conclude that there s statstcally sgnfcant between-study varaton. σ, One problem wth the Q statstc s that ts statstcal power depends on the number of studes, wth power beng very low or very hgh for a small or a large number of studes, respectvely. To solve the problems of the Q statstc and the non comparablty of the between-studes varance, τ, among meta-analyses wth dfferent effect-sze metrcs, Hggns and Thompson (00) have recently proposed the I ndex. The I ndex quantfes the extent of heterogenety from a collecton of effect szes by comparng the Q value to ts expected value assumng homogenety, that s, to ts degrees of freedom (df = k 1): I Q ( k 1) 100% Q = 0 for Q > ( k -1) for Q ( k -1) (13) When the Q statstc s smaller than ts degrees of freedom, then I s truncated to zero. The I ndex can easly be nterpreted as a percentage of heterogenety, that s, the part of total varaton that s due to between-studes varance, τˆ. Therefore, there s a drect relatonshp between τˆ and I that can be formalzed from the equatons (10) and (13) as, I c ˆ = Q τ (14) 3 In practce, the populaton wthn-study varances never wll be known, so they wll have to be estmated from the sample data. For example, equatons (5) and (7) are used to estmate the wthn-study varances for d and g ndces.

14 Assessng heterogenety n meta-analyss 13 To show emprcally ths relaton, Fgure 1 presents the results of a smulaton, assumng a random-effects model wth δ = 0.5, k = 50, an average sample sze N = 50 (n E = n C for every study), and manpulatng the parametrc between-studes varance, τ, wth values from 0.0 to 0.45, and 5 replcatons per condton. Fgure 1 represents the obtaned values of τˆ and I for every replcaton. So, for the manpulated condtons τˆ values around 0.05, 0.05, and 0.15 correspond to I values of 5%, 50%, and 75%, respectvely. Further, note that beyond a certan value of τ there s relatvely lttle ncrease n I. In partcular, I values hgher than 85% wll subsequently ncrease only slghtly even f the between-studes varance ncreases substantally. Therefore, the I ndex seems partcularly useful n descrbng heterogenety n a meta-analyss wth a medum-to-low between-studes varance, and not so useful for large τ values. Hggns and Thompson (00) have also developed a confdence nterval for I. The nterval s formulated by calculatng another of ther proposed measures of heterogenety, the H ndex obtaned by (Hggns & Thompson, 00, p. 1545, eq. 6),QH=k-1, (15) also known as Brge s rato (Brge, 193). Then they defne I n terms of H by means of (Hggns & Thompson, 00, p. 1546, eq. 10), I H 1 = 100%. (16) H Ths allows us to express nferences of H n terms of I. For practcal applcaton, Hggns and Thompson (00, p. 1549) recommend a confdence nterval for the natural logarthm of H, ln(h), assumng a standard normal dstrbuton, that mples the Q statstc and k, gven by, { H ± zα / [ H ]} exp ln( ) SE ln( ), (17) where z α/ s the (α/) quantle of the standard normal dstrbuton, and SE[ln(H)] s the standard error of ln(h) and s estmated by

15 Assessng heterogenety n meta-analyss 14 [ H ] SE ln( ) 1 ln( Q) ln( k 1) f ( Q) (k 3) = f ( k ) 3( k ) Q > k Q k (18) The confdence lmts obtaned by equaton (15) are n terms of the H ndex. Consequently, they can be easly translated nto the I metrc by applyng equaton (16) to both confdence lmts. An example wll help to llustrate the calculatons for the Q statstc and the I ndex. Fgure presents some of the results of a meta-analyss about the effectveness of delnquent rehabltaton programs (Redondo, Sánchez-Meca, & Garrdo, 1999). In partcular, Fgure presents the results of eght studes that compared a control group wth one of two dfferent correctonal programs: three studes that compared a control group wth a cogntve-behavoral treatment (CBT) and fve studes that compared a control group wth a therapeutc communty program (TC). The comparsons were measured by the d ndex such as t s defned by equaton (). The purpose of the example s to llustrate the problems of the Q statstc and how the I ndex s able to solve them. As Fgure shows, the forest plot for the two groups of studes (the three studes for CBT and those for TC) reflect hgh heterogenety n both cases, but heterogenety s more pronounced for CBT studes than for TC studes. In fact, the estmated betweenstudes varance, ˆτ, for CBT s clearly hgher than for TC (0.4 and 0.06, respectvely). However, the Q statstc s very smlar and statstcally sgnfcant n both cases [CBT: Q() = , p =.003; TC: Q(4) = , p =.018]. Thus, a drect comparson of the two Q values s not justfed because ther degrees of freedom dffer, and can erroneously lead to the concluson that the two groups of studes are smlarly heterogeneous. But f we calculate the I ndex for both groups, then dfferences n the extent of heterogenety are clearly apparent: whereas CBT studes present an I value of 8.8%, mplyng hgh heterogenety, the TC studes present an I value of medum sze (66.5%). Thus, the I ndex has been able to reflect dfferences n the degree of

16 Assessng heterogenety n meta-analyss 15 heterogenety between two groups of studes when the Q statstc offers very smlar results for them. The Q statstc s only useful for testng the exstence of heterogenety, but not the extent of heterogenety. The I ndex quantfes the magntude of such heterogenety and, f a confdence nterval s calculated for t, then t can also be used for testng the heterogenety hypothess. In the example, the confdence lmts obtaned for the I ndex applyng equaton (15) were for CBT studes from 47.6% to 94.4%, and for TC studes from 1.7% to 87.1%. In both cases, the 0% value s not contaned by the confdence nterval, showng the exstence of heterogenety and concdng wth the results obtaned wth the Q statstc. On the other hand, the wdth of the I CI nforms about the accuracy of the true heterogenety estmaton. Thus, as the number of CBT studes s hgher than that of TC studes ts true heterogenety estmaton s more accurate (confdence wdth = 46.8% and 74.4%, respectvely). Therefore, the I ndex wth ts confdence nterval can substtute for the Q statstc, because t offers more nformaton. To further show the usefulness of the I ndex to compare the extent of heterogenety among dfferent meta-analyses, Table 1 presents the results of four metaanalyses about treatment outcome n the socal and behavoral scences, n terms of ther Q tests and I ndces. As every meta-analyss has a dfferent number of studes (k), the Q values are not comparable. However, the I ndces enable to assess the extent of true heterogenety as a percentage of total varaton. So, for the three frst meta-analyses ther respectve Q values only nform about the exstence of heterogenety, whle the I values allow us to dentfy the Sánchez-Meca et al. (1999) meta-analyss as showng the largest heterogenety (I = 90.8%; 95% CI: 88.6% and 9.9%), n comparson to the other two (I = 67.3%, 95% CI: 57% and 75.%; and I = 74.%; 95% CI: 64.8% and 8.3%). On the other hand, the only meta-analyss wth a nonsgnfcant Q test concdes wth a I = 0%. Method The smulaton study was programmed n GAUSS (Aptech Systems, 199). For smulatng each ndvdual study, we have assumed a two-groups desgn (expermental versus control) and a contnuous outcome. Two dfferent effect-sze ndces, both

17 Assessng heterogenety n meta-analyss 16 pertanng to the d metrc, were defned: the standardzed mean dfference d ndex defned by Hedges and Olkn (1985) and the g ndex proposed by Glass et al. (1981). The man dfference between them s the standard devaton used, as noted above. To smulate a collecton of k sngle studes we assumed a random-effects model. Thus, from a normal dstrbuton of parametrc effect szes, θ, wth mean µ θ = 0.5 and between-studes varance τ [θ N(0.5, τ )], collectons of k studes were randomly generated. The mean effect-sze parameter was fxed at µ θ = 0.5, as t can be consdered an effect of medum magntude (Cohen, 1988). 4 Once a θ value was randomly selected, two dstrbutons (for the expermental and control groups) were generated, wth means µ E = θ and µ C = 0, varance for the control group equal to 1 (σ C = 1) and varance for the expermental group equal 1,, or 4 (σ E = 1,, or 4), dependng on the rato between σ E and σ C. The dstrbutons for scores n expermental and control groups mght be normal or non-normal, wth dfferent values of skewness and kurtoss n the non-normal cases. Then, two random samples (expermental and control) were selected from the two dstrbutons wth szes n E = n C, and the means ( y E and y C ) and standard devatons (S E and S C ) were obtaned. Thus, the standardzed mean dfferences, d (eq. ) and g (eq. 6), and ther samplng varances, S d (eq. 5) and S g (eq. 7), were calculated. The calculatons for the d and g ndces, and ther samplng varances, were repeated for each one of the k studes of each smulated meta-analyss. Then, for every set of effect estmates (d and g ndces), the calculatons to obtan the Q statstc wth ts statstcal sgnfcance and the I ndex wth ts confdence nterval were carred out, applyng equatons (11), (1), and (15), respectvely. Thus, the followng factors were manpulated n the smulatons: (a) The between-studes varance, τ, wth values 0, 0.04, 0.08, and When τ = 0, the statstcal model becomes a fxed-effects model, because there s no between-studes varance. The selected values of τ were smlar to those used n other smulaton studes (Bggerstaff & Tweede, 1997; Brockwell & Gordon, 001; Erez et al., 1996; Feld, 001; Hedges & Vevea, 1998; Overton, 1998). 4 Addtonal smulatons varyng the value of µθ showed smlar results to that of µ θ = 0.5 for the Q statstc and the I ndex. Thus, we mantaned fxed µ θ to smplfy the smulaton desgn.

18 Assessng heterogenety n meta-analyss 17 (b) The number of studes for each meta-analyss, k, wth values 5, 10, and 0. These values for k are common n real meta-analyses and they were selected to study the performance of Q and I when the number of studes s small, because the lterature suggests poor performance under these condtons (Hardy & Thompson, 1998; Harwell, 1997; Sánchez-Meca & Marín-Martínez, 1997). (c) The wthn-study varances for expermental and control groups were vared usng ratos for expermental and control groups, respectvely, of 1:1, :1, and 4:1 as suggested n the lterature (e.g., McWllams, 1991; Wlcox, 1987). The varance of the expermental group was ncreased n comparson to that of the control group because ncreases n varablty are more plausble when there s expermental manpulaton (e.g., a psychologcal treatment) (Glass et al., 1981). (d) Usually, the studes ntegrated n a meta-analyss have dfferent sample szes. Thus, the mean sample sze for each generated meta-analyss was vared wth values N = 30, 50, and 80. The sample-sze dstrbuton used n the smulatons was obtaned by a revew of the meta-analyses publshed n 18 nternatonal psychologcal journals. Ths revew enabled us to obtan a real sample-sze dstrbuton characterzed by a Pearson skewness ndex of (more detaled nformaton s gven n Sánchez-Meca & Marín-Martínez, 1998). In accord wth ths value, three vectors of fve Ns each were selected averagng 30, 50, or 80, wth the skewness ndex gven above to approxmate real data: [1, 16, 18, 0, 84], [3, 36, 38, 40, 104], and [6, 66, 68, 70, 134]. Each vector of Ns was then replcated ether or 4 tmes for meta-analyses of k = 10 and 0 studes, respectvely. The wthn-study sample szes for the expermental and control groups were equal (n E = n C, beng N = n E + n C, for each sngle study). For example, the sample szes vector [1, 16, 18, 0, 84] means that the expermental and control group sample szes were, respectvely, [n E = n C = 6, 8, 9, 10, 4]. (e) Scores for the expermental and control partcpants n each pseudo-study were generated assumng a varety of dfferent dstrbutons: both normal dstrbutons and non-normal dstrbutons. To generate non-normal dstrbutons, the normalty pattern was manpulated to obtan skewed dstrbutons by means of the Fleshman (1978) algorthm, wth the followng values of skewness/kurtoss: 0.5/0, 0.75/0, and 1.75/3.75. These values of

19 Assessng heterogenety n meta-analyss 18 skewness and kurtoss can be consdered of a moderate magntude (DeCarlo, 1997; Hess, Olejnk, & Huberty, 001). To smplfy the desgn of the smulaton study we dd not cross all of the manpulated factors. In the condton of normal dstrbutons for the expermental and control groups n the sngle studes, we crossed all the factors mentoned above, obtanng a total of 4 (τ values) x 3 (k values) x 3 (varance ratos) x 3 ( N values) = 108 condtons. For the three condtons n whch the score dstrbutons of the sngle studes were non-normal, the desgn of the smulaton was smplfed by reducng the number of studes n each meta-analyss to only two condtons: k = 5 and 0. Thus, the number of condtons was 3 (τ values) x (k values) x 3 (varance ratos) x 3 ( N values) x 3 (non-normal dstrbutons) = 16. Therefore, the total number of manpulated condtons was 108 (normal dstrbutons) + 16 (non-normal dstrbutons) = 70 condtons. For each of the 70 condtons, 10,000 replcatons were generated. To obtan estmates of the Type I error rate and statstcal power for the Q statstc and the confdence nterval for the I ndex, assumng a sgnfcance level of α =.05, the followng computatons over the 10,000 replcatons n each condton were carred out, (a) In condtons where the between-studes varance was zero (τ = 0), the proporton of false rejectons of the null hypothess of homogenety n the 10,000 replcatons was the emprcal Type I error rate for the Q statstc. Smlarly, the proporton of replcatons n whch the confdence nterval for I dd not contan the value τ = 0 represented ts emprcal Type I error rate. Followng Cochran (195) we assumed that good control of the Type I error rate for α = 0.05 mples emprcal rates n the range (b) In condtons wth non zero between-studes varance (τ > 0), the proporton of rejectons of the homogenety hypothess was the emprcal power for the Q statstc, and the proporton of replcatons n whch the confdence nterval for I dd not contan the value τ = 0, was an estmate of the power of ths procedure. Followng Cohen (1988), we adopted 0.80 as the mnmum advsable power.

20 Assessng heterogenety n meta-analyss 19 Results Frst, we wll present the results obtaned through the manpulated condtons n respect to the control of Type I error rates acheved by the Q test and the confdence nterval of I (I CI) both for the d and g ndces. Then, the results n terms of statstcal power wll be shown. 5 Type I error rate. Estmated Type I error rates were obtaned when the between-studes varance was zero (τ = 0). For each condton, the Type I error rate was calculated dvdng by 10,000 the number of replcatons n whch the null hypothess was ncorrectly rejected usng the Q test, or the number of replcatons n whch the value zero was not n the I CI. Fgure 3 presents results for Type I error rates as a functon of the number of studes and the average sample sze under the condtons assumng normalty and homoscedastcty n the expermental and control groups dstrbutons. As Fgure 3 shows, good control of the Type I error rate s acheved wth both the Q test and the I CI when the d ndex s used, but not wth the g ndex. The good control of the Type I error for Q and I CI wth the d ndex s nether affected by the number of studes nor by the average sample sze n the meta-analyss. However, note that the Type I error rate for I CI wth the d ndex s slghtly lower than the.04 lmt that we have assumed as representng a good adjustment to the.05 nomnal sgnfcance level. On the other hand, wth the g ndex, Q and I CI present Type I error rates clearly hgher than the nomnal α =.05, and mportantly above the.06 lmt. Ths poor performance slghtly ncreases wth the number of studes, but dmnshes wth the average sample sze. When the expermental- and control-group dstrbutons were normal but the homoscedastcty assumpton was not met, both Q and I CI mantaned good control of the Type I error rate wth the d ndex (although the Type I error rate for I CI beng slghtly under the.04 lmt). Ths result was not affected by the number of studes and the average sample sze, as Fgure 4 shows. However, wth the g ndex, a dramatc ncrease of the Type I error rate for Q and I CI was found as the rato between

21 Assessng heterogenety n meta-analyss 0 expermental and control groups varances was ncreased. As Fgure 4 shows, the poor performance of Q and I CI for the g ndex s affected by the number of studes and the average sample sze, wth trends smlar to those obtaned assumng normalty and homoscedastcty. When the expermental- and control-group dstrbutons were non-normal and the homoscedastcty assumpton was met, the control of the Type I error rate was good for both the Q test and I CI computed for the d ndex. However, as the dstrbutons devated from normalty, the Type I error rates of Q and the I CI for the g ndex suffered a drastc ncrease. Fnally, when the normalty and homoscedastcty assumptons were not met, the Type I error rates of Q and I CI for the d ndex mantaned ther proxmty to the nomnal α =.05, whereas the performance of Q and I CI for the g ndex remaned very poor (see Fgure 5). Statstcal power. The estmated power values were obtaned when between-studes varance was hgher than zero (τ > 0). For each condton, the power value was calculated by dvdng by 10,000 the number of replcatons n whch the null hypothess s correctly rejected usng the Q test, or the number of replcatons n whch the zero value was not n the confdence nterval of I. Fgure 6 shows the estmated power values when the normalty and homoscedastcty assumptons were met, as a functon of the number of studes and the between-studes varance. As expected, the estmated power for all of the procedures ncreased as the number of studes and the between-studes varance ncreased. The results also showed that the recommended 0.8 power value (Cohen, 1988) was reached only when there were 0 or more studes and a large between-studes varance (τ 0.16). Smlar power results were obtaned as a functon of the average sample sze. Wth normal dstrbutons and heteroscedastc varances the power values for Q and I CI showed smlar trends as a functon of the number of studes: the hgher the number of studes the hgher the power (see Fgure 7). Although the trend was smlar 5 Because of space lmtatons, not all of the tables and fgures for all of the manpulated condtons are

22 Assessng heterogenety n meta-analyss 1 for all of the procedures, Q and I CI acheved a hgher power when the g ndex was used n comparson wth the d ndex. The better power obtaned wth the g ndex under heterogeneous varances occurred because g uses the control group standard devaton, whereas the d ndex uses a pooled standard devaton obtaned from the expermental and control groups. In our smulatons we assumed, as Glass suggested (Glass et al., 1981), control-group standard devatons smaller than those of the expermental groups. Ths crcumstance leads to hgher heterogenety among g ndces than among d ndces. As a consequence, t s easer for Q and I CI to detect heterogenety among g ndces. Fnally, smlar power results were obtaned when the normalty and homoscedastcty assumptons were not met. As Fgure 8 shows, Q and I CI acheved hgher power values wth the g ndex than wth the d ndex. However, the nflated Type I error rates obtaned wth the g ndex mples an napproprate performance of Q and I CI wth ths ndex. Dscusson Tradtonally, the Q test has been the normal procedure n assessng the heterogenety hypothess n meta-analyss (Cooper & Hedges, 1994). Recently, a new statstc named I, and a confdence nterval around t, has been proposed to estmate the extent of heterogenety, as well as ts statstcal sgnfcance (Hggns & Thompson, 00; Hggns et al., 003). Assessng heterogenety n meta-analyss s a crucal ssue because the meta-analyst s decson to select the statstcal model to be appled n a metaanalyss (fxed- versus random-effects model) can be affected by the result of a homogenety test. Due to the mportance of ths ssue, the purpose of ths paper was to compare the performance of two procedures, the Q test and I CI, to assess the heterogenety among a set of sngle studes n a meta-analyss. In partcular, Type I error rates and statstcal power of the two procedures were examned by means of Monte Carlo smulaton as a functon of the number of studes, the average sample sze, the between-studes varance, and the normalty and homoscedastcty of the expermentaland control-group dstrbutons. On the other hand, two dfferent effect-sze ndces pertanng to the d famly were used to calculate the Q test and the I CI: d and g ndces. A comparson between the Q test and the I CI has not yet been carred out. presented. Interested readers can request the complete set of tables and fgures from the authors.

23 Assessng heterogenety n meta-analyss Therefore, the results of our study cast some lght on the performance of both procedures n assessng heterogenety n a meta-analyss. The results of the smulaton study helped us reach several conclusons related to our goals. In respect to the control of Type I error rate, the performance of the Q test and the I CI was very smlar. In fact, there were more dfferences between the procedures based on d and g ndces than between the Q test and the I CI. In partcular, wth the d ndex both procedures acheved good control of the Type I error rate, whereas the performance of the Q test and the I CI calculated wth the g ndex was very poor. On the other hand, Type I error rates for both procedures wth the d ndex were not affected by the number of studes and the average sample sze. However, the performance of the Q test and the I CI depend on the effect-sze metrc. Therefore, confdence ntervals around I obtaned from meta-analyses wth dfferent effect-sze metrcs should be nterpreted cautously, because they may not be comparable. In respect to statstcal power, there were no notable dfferences between the Q test and the I CI. As expected, both procedures exhbted hgher power as the number of studes, the average sample sze, and the between-studes varance ncreased. However, wth a small number of studes (k < 0) and/or average of sample sze ( N < 80), the power s under the mnmum advsable value 0.8. In fact, both procedures calculated wth the d ndex reached power values as small as 0.3 n some condtons. Therefore, the I CI suffers the same problem as the Q test n terms of statstcal power. On the other hand, the power of these procedures calculated wth the g ndex was hgher than that obtaned wth the d ndex. However, the hghest power for summares of the g ndex was acheved at the expense of an nadmssbly large Type I error rate. Therefore, the performance of the Q test and I CI wth the g ndex s poor. In any case, the usefulness of our results for the g ndex should be lmted to real metaanalyses where systematcally the varablty n the expermental groups s hgher than that of the control groups; ths only wll happen when the mplementaton of a treatment produces an overdsperson of the subject scores n comparson to the control group scores. The poor Type I error performance of the Q test and the I ndex wth g ndex under normalty and homoscedastcty rases varous concerns, ncludng the accuracy of the samplng varance of ths ndex. Our results also show a neglgble effect on the

24 Assessng heterogenety n meta-analyss 3 Type I error rates and statstcal power of the Q test and the I CI wth the d ndex when the usual assumptons about the expermental- and control-group dstrbutons (normalty and homoscedastcty) are not met. In summary, our fndngs show that the I CI performs n a smlar way to the Q test from an nferental pont of vew. But the I ndex has mportant advantages n respect to the classcal Q test. Frst, t s easly nterpretable because t s a percentage and does not depend on the degrees of freedom. Another advantage s that t provdes a way of assessng the magntude of the heterogenety n a meta-analyss, whereas the Q test reports about the statstcal sgnfcance of the homogenety hypothess. On the other hand, the I CI nforms about the accuracy of the true heterogenety estmaton. In addton, the I ndex can be used to assess the degree of msspecfcaton error when a qualtatve moderator varable s tested. In partcular, for every category of the moderator varable, an I ndex can be calculated and ther values are drectly compared n order to determne whch categores show a good ft to the statstcal model and whch ones do not. On the other hand, the I ndex can be useful to compare the fttng of alternatve models wth dfferent moderator varables regardless of ther degrees of freedom. Future research n ths area can help to ascertan the usefulness of the I ndex when the statstcal model n a meta-analyss ncludes moderator varables. Some warnngs for the use of the I ndex have to be taken nto account. The confdence nterval around I used to assess the homogenety hypothess n metaanalyss suffers the same problems of low power that the Q test does when the number of studes s small. The I CI does not solve the shortcomngs of the Q test. Therefore, usng ether the I CI or the Q test to decde upon the statstcal model (fxed- versus random-effects model) n a meta-analyss can be msleadng. Wth a small number of studes (k < 0) both the I CI and the Q test should be nterpreted very cautously. As the I ndex and ts confdence nterval allow us to assess smultaneously both the statstcal sgnfcance and the extent of heterogenety, the meta-analyst can obtan a more complete pcture of heterogenety than that offered by the Q test. Therefore, we propose usng I and ts confdence nterval to assess heterogenety n meta-analyss,

25 Assessng heterogenety n meta-analyss 4 although takng nto account ts low statstcal power when the number of studes s small. On the other hand, our results comparng the d and g ndces have shown very dfferent performances for the I CI dependng on the effect-sze metrc. Under our manpulated condtons, the g ndex systematcally showed an napproprate control of the Type I error rate and, therefore, usng the Q test or the I CI wth ths ndex s unadvsable. However, the poor performance that we have found for the Q test and the I CI wth the g ndex s only applcable when the studes systematcally present a hgher varablty n the expermental group than n the control group. More research should be carred out to study the comparablty of the I ndex wth other effect-sze metrcs, such as correlaton coeffcents, odds ratos, and so on. Fnally, t should be noted that the results of our study are lmted to the smulated condtons. Consequently, addtonal research efforts manpulatng other factors, or examnng dfferent levels of these factors, can help to assess the generalzablty of our fndngs. References Alexander, R. A., Scozzaro, M. J., & Borodkn, L. J. (1989). Statstcal and emprcal examnaton of the ch-square test for homogenety of correlatons n metaanalyss. Psychologcal Bulletn, 106, Aptech Systems, Inc. (199). The GAUSS system (Vers. 3.0). Kent, WA: Author. Bggerstaff, B. J., & Tweede, R. L. (1997). Incorporatng varablty estmates of heterogenety n the random effects model n meta-analyss. Statstcs n Medcne, 16, Brge, R. T. (193). The calculaton of errors by the method of least squares. Physcal Revew, 40, Brockwell, S. E., & Gordon, R. I.(001) A comparson of statstcal methods for metaanalyss. Statstcs n Medcne, 0, Cochran, W. G. (195). The χ test of goodness of ft. Annals of Mathematcal Statstcs, 3, Cochran, W. G. (1954). The combnaton of estmates from dfferent experments. Bometrcs, 10,

26 Assessng heterogenety n meta-analyss 5 Cohen, J. ( 1988). Statstcal power analyss for the behavoral scences ( nd ed.). New York: Academc Press. Cooper, H. (1998). Integratng research: A gude for lterature revews (3 rd ed.). Newbury Park, CA: Sage. Cooper, H., & Hedges, L. V. (Eds.). (1994). The handbook of research synthess. New York: Russell Sage Foundaton. Cornwell, J. M. (1993). Monte Carlo comparson of three tests for homogenety of ndependent correlatons. Educatonal & Psychologcal Measurement, 53, Cornwell, J. M., & Ladd, R. T. (1993). Power and accuracy of the Schmdt and Hunter meta-analytc procedures. Educatonal & Psychologcal Measurement, 53, DeCarlo, L. T. (1997). On the meanng and use of kurtoss. Psychologcal Methods,, DerSmonan, R., & Lard, N. (1986) Meta-analyss n clncal trals. Controlled Clncal Trals, 7, Egger, M., Smth, G. D., & Altman, D. G. (Eds.). (001). Systematc revews n health care: Meta-analyss n context ( nd ed.). London: BMJ Publshng Group. Erez. A., Bloom, M. C., & Wells, M. T. (1996). Usng random rather than fxed effects models n meta-analyss: Implcatons for stuatonal specfcty and valdty generalzaton. Personnel Psychology, 49, Feld, A. P. (001). Meta-analyss of correlaton coeffcents: A Monte Carlo comparson of fxed- and random-effects methods. Psychologcal Methods, 6, Feld, A. P. (003). The problems n usng fxed-effects models of meta-analyss on real-world data. Understandng Statstcs,, Fleshman, A. I. (1978). A method for smulatng nonnormal dstrbutons. Psychometrka, 43, Glass, G. V. (1976). Prmary, secondary, and meta-analyss of research. Educatonal Research, 5, 3-8. Glass, G. V., McGaw, B., & Smth, M. L. (1981). Meta-analyss n socal research. Newbury Park, CA: Sage. Hardy, R. J., & Thompson, S. G. (1996). A lkelhood approach to meta-analyss wth random effects. Statstcs n Medcne, 15,

27 Assessng heterogenety n meta-analyss 6 Hardy, R. J., & Thompson, S. G. (1998). Detectng and descrbng heterogenety n meta-analyss. Statstcs n Medcne, 17, Harwell, M. (1997). An emprcal Study of Hedge s homogenety test. Psychologcal Methods,, Hedges, L. V. (1994). Fxed effects models. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthess (pp ). New York: Russell Sage Foundaton. Hedges, L. V., & Olkn, I. (1985). Statstcal methods for meta-analyss. Orlando, FL: Academc Press. Hedges, L. V., & Vevea, J. L. (1998). Fxed- and random-effects models n metaanalyss. Psychologcal Methods, 3, Hess, B., Olejnk, S., & Huberty, C. J. (001). The effcacy of two mprovement-overchance effect szes for two-group unvarate comparsons under varance heterogenety and nonnormalty. Educatonal & Psychologcal Measurement, 61, Hggns, J. P. T., & Thompson, S. G. (00). Quantfyng heterogenety n a metaanalyss. Statstcs n Medcne, 1, Hggns, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (003). Measurng nconsstency n meta-analyses. Brtsh Medcal Journal, 37, Hunter, J. E., & Schmdt, F. L. (000). Fxed effects vs random effects meta-analyss models: Implcatons for cumulatve research knowledge. Internatonal Journal of Selecton & Assessment, 8, Hunter, J. E., & Schmdt, F. L. (004). Methods of meta-analyss: Correctng error and bas n research fndngs ( nd ed.). Newbury Park, CA: Sage. Lpsey, M. W. (1994). Identfyng potentally nterestng varables and analyss opportuntes. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthess (pp ). New York: Russell Sage Foundaton. McWllams, L. (1991, Aprl). Varance heterogenety n emprcal studes n educaton and psychology. Paper presented at the annual colloquum of the Amercan Educatonal Research Assocaton, San Francsco. Moreno, P. J., Méndez, F. X., & Sánchez-Meca, J. (001). Effectveness of cogntvebehavoural treatment n socal phoba: A meta-analytc revew. Psychology n Span, 5, 17-5.

Confidence Intervals for the Overall Effect Size in Random-Effects Meta-Analysis

Confidence Intervals for the Overall Effect Size in Random-Effects Meta-Analysis Psychologcal Methods 008, Vol. 13, No. 1, 31 48 Copyrght 008 by the Amercan Psychologcal Assocaton 108-989X/08/$1.00 DOI: 10.1037/108-989X.13.1.31 Confdence Intervals for the Overall Effect Sze n Random-Effects