An Introduction to Meta-Analysis

An Introduction to Meta-Analysis Douglas G. Bonett University of California, Santa Cruz How to cite this work: Bonett, D.G. (2016) An Introduction to Meta-analysis. Retrieved fro http://people.ucsc.edu/~dgbonett/eta.htl

Contents 1 Measures of Effect-size 1.1 Population Paraeters and Rando Saples............... 1 1.2 Mean Difference......................... 2 1.3 Standardized Mean Difference.................... 4 1.4 Risk Difference, Risk Ratio, and Odds Ratio............... 6 1.5 Pearson and Spearan Correlations................. 8 1.6 Regression Slope Coefficients.................... 9 1.7 Reliability and Agreeent Coefficients................. 10 1.8 Effect-size Conversions....................... 12 2 Meta-Analysis Models 2.1 Multi-study Statistical Models.................... 16 2.2 Constant-coefficient Model..................... 16 2.3 Varying-coefficient Model...................... 17 2.4 Rando-coefficient Model..................... 19 2.5 Model Choice Recoendations................... 20 3 Cobining Results fro Multiple Studies 3.1 Mean Differences........................ 24 3.2 Standardized Mean Differences................... 26 3.3 Risk Differences......................... 27 3.4 Odds Ratios and Risk Ratios.................... 28 3.5 Pearson and Spearan Correlations.................. 29 3.6 Regression Slope Coefficients.................... 30 3.7 Reliability and Agreeent Coefficients................. 31 4 Coparing Results fro Multiple Studies 4.1 Linear Contrasts......................... 34 4.2 Subgroup Coparisons...................... 37 4.3 Linear Models.......................... 38 4.4 Assessing Publication Bias..................... 41 5 Replication Designs 5.1 Designs with One Replication Saple................. 44 5.2 Designs with Multiple Replication Saples............... 46 5.3 Evaluating Non-replication Evidence.................. 47 5.4 Linear Contrasts for Means and Proportions............... 48 5.5 Replication-extension Designs.................... 49 References............................. 52 Study Guide............................. 53

1 Chapter 1 Measures of Effect Size 1.1 Population Paraeters and Rando Saples A population paraeter is an unknown nueric value that describes all ebers of a specific study population. In the behavioral sciences, the study population is usually soe large group of people such as all school teachers in a particular county, all students in a research participant pool at soe university, or all residents in a specified geographic area. Greek letters will be used to represent population paraeters such as a population ean (μ), a population proportion (π), a population standard deviation (σ), and a population Pearson correlation (ρ). The Pearson correlation is also referred to as a easure of effect size. Other easures of effect size are: a difference in population eans (μ 1 μ 2 ), a difference in population proportions (π 1 π 2 ), and a ratio of population proportions (π 1 /π 2 ). Other easures of effect size are defined in sections 1.2-1.5. In applications where the study population is large or the cost of easureent is high, the researcher ay not have the necessary resources to easure all people in the study population. In these applications, the researcher could take a rando saple of n people (participants) fro the study population. A rando saple of size n is selected in such a way that every possible saple of size n will have the sae chance of being selected. A population paraeter can be estiated fro data obtained in a rando saple of participants. We will consider data that are in the for of quantitative response variable scores (e.g., test scores, physiological easureents, opinion ratings) or dichotoous response variable scores (e.g., pass or fail, agree or disagree, correct or incorrect answer). The response variable score for participant i will be denoted as y i. If the response variable is quantitative (i.e., easured on an interval or ratio scale), y i can be any positive or negative nueric value. If the response variable is dichotoous, y i can have only two values (e.g., 0 or 1). Paraeter estiates in isolation can be isleading because they will contain sapling error (the difference between the estiate and the paraeter value) of unknown direction and unknown agnitude. The variance of a paraeter estiate nuerically describes

2 the accuracy of the estiate. The square root of this variance is called the standard error of the estiate. A sall standard error value indicates that the paraeter estiate is likely to be close to the unknown population paraeter value, while a large standard error value indicates that the paraeter estiate could be very different fro the study population paraeter value. The standard error of a paraeter estiate can be estiated fro a rando saple. The paraeter estiate and its estiated standard error can then be used to construct a confidence interval for the study population paraeter. A (1 α)% confidence interval for a study population paraeter is a range of values that are believed, with (1 α)% confidence, to include the unknown paraeter value. Cobining or coparing paraeter estiates fro 2 studies is called a eta-analysis. The saple size fro a single study is often too sall to obtain a usefully narrow confidence interval for a population effect size. If results fro studies are cobined, the confidence interval for the average effect size could be substantially narrower than the confidence intervals for any effect size fro a single study. Differences in effect sizes across study populations could reflect interesting oderator effects that will help clarify and extend theories. A oderator effect describes differences in effect sizes across the levels of a oderator variable. For instance, if the correlation between social doinance and the experience of icro-aggressive behaviors is greater for woen than en, then gender is said to oderate the relation between social doinance and icro-aggression. Estiates of several population paraeters and effect sizes in study k (k = 1 to ) along with their estiated variances are given in the following sections. Most readers will want to just quickly ski these sections to get a general idea of the different paraeters and effect sizes that will be considered in subsequent chapters. The forulas given in this chapter will be used in later chapters in the context of specific types of analyses. 1.2 Mean Difference Two-Group Designs In a two-group design fro study k, group 1 represents a rando saple of n 1k participants, and group 2 represents a rando saple of n 2k participants. Participant i in group j receives a quantitative score y ijk. Estiates of population eans μ 1 and μ 2 fro study k are

3 n 1k μ 1k = i=1 y i1k /n 1k (1.1a) n 2k μ 2k = i=1 y i2k /n 2k (1.1b) 2 2 and estiates of population variances σ 1k and σ 2k fro study k are 2 = i=1 (y i1k μ 1k ) 2 /(n 1k 1) (1.2a) σ 1k σ 2k n 1k n 2k 2 = i=1 (y i2k μ 2k ) 2 /(n 2k 1). (1.2b) A coon easure of effect size in a two-group design with a quantitative response variable fro study k is k = μ 1k μ 2k. If the two-group design is an experient (i.e., participants are randoly assigned to groups with participants in each group receiving a specific treatent), then k describes the causal of effect of treatent on the response variable. An estiate of k and its estiated variance are k = μ 1k μ 2k (1.3) var ( k) = σ 1k 2 /n 1k + σ 2k 2 /n 2k. (1.4) Paired-Saples Designs Paired-saples designs have several special cases: the pretest-posttest design (where participants are easured before and then after receiving a treatent), the within-subject experiental design (where participants receive two different treatents in rando order), and the longitudinal design (where participants are easured at two points in tie to assess change over tie). In a paired-saples design for a rando saple of n k participants in study k where participant i is assigned two quantitative scores, y i1k and y i2k, estiates of population eans μ 1k and μ 2k in study k are n k μ 1k = i=1 y i1k /n k (1.5a) n k μ 2k = i=1 y i2k /n k, (1.5b) 2 2 estiates of population variances σ 1k and σ 2k in study k are 2 = i=1 (y i1k μ 1k ) 2 /(n k 1) (1.6a) σ 1k σ 2k n k n k 2 = i=1 (y i2k μ 2k ) 2 /(n k 1), (1.6b)

4 and an estiate of the Pearson correlation between the two quantitative responses is n k ρ k = i=1 (y i1k μ 1k)(y i2k μ 2k) /[σ 1k σ 2k (n k 1)]. (1.7) An estiate of k = μ 1k μ 2k for the paired-saples design is the sae as for a twogroup design (Equation 1.3). The estiated variance of k for a paired-saple design in study k is 2 var ( k) = (σ 1k + σ 2k 2 2ρ kσ 1k σ 2k )/n k. (1.8) 1.3 Standardized Mean Differences If the quantitative response variable scores do not have a clear psychological eaning, then it could be difficult to deterine if the value of k represents an iportant or uniportant effect. In these situations, a standardized ean difference, which is a unitless easure of effect size, ight be easier to interpret. The following standardized ean differences for study population k are useful easures of effect size δ k = (μ 1k μ 2k )/ (σ 2 1k + σ 2 2k )/2 (1.9) δ k = (μ 1k μ 2k )/σ 1k (1.10) where σ 1k is the standard deviation in a control, placebo, or standard treatent condition. The denoinators of Equations 1.9 and 1.10 are called standardizers. Equations 1.9 and 1.10 have siple and useful interpretations if the response variable has an approxiate noral distribution. The interpretability of δ and δ relies on an iportant feature of the noral distribution the inflection points of the noral distribution are one standard deviation fro the ean. Knowing this property of the noral distribution it is then easy to visualize the distance between two noral distributions that differ by aounts equal to δ k or δ k. The interpretability of δ k (but not δ k ) also requires the population variances to be siilar. Estiates of δ k and δ k and their estiated variances are given below for the two-group and paired-saples designs.

5 Two-Group Designs An estiator of δ k in the two-group design fro study k is δ k = 2 k/ (σ 1k + σ 2k 2 )/2 (1.11) and its estiated variance is var (δ k) = δ k2 (σ 1k 4 /df 1 + σ 2k 4 /df 2 ) /8σ 4 + (σ 1k 2 /df 1 + σ 2k 2 /df 2 )/σ 2 (1.12) where σ 2 = (σ 1k 2 + σ 2k 2 )/2 and df j = n jk 1. It is coon practice to estiate δ k using the pooled variance standardizer 2 [(n 1k 1)σ 1k + (n 2k 1)σ 2k 2 ]/(n 1k + n 2k 2). However, unless the saple sizes are equal or the population variances are identical, using a pooled variance standardizer gives an inconsistent estiator (i.e., an estiator that biased even in large saples) of Equation 1.9. The pooled variance standardizer is not recoended. An estiator of δ k in the two-group design is δ k = k/σ 1k (1.13) and its estiated variance is var (δ k ) = δ k 2 /2df 1 + 1/df 1 + σ 2k 2 /σ 1k 2 df 2 (1.14) where df j = n jk 1. Paired-saples Designs The estiators of δ and δ in the paired-saples design are the sae as in a two-group design (Equations 1.11 and 1.13). Their estiated variances in study k are var (δ k) = δ k2 (σ 1k 4 + σ 2k 4 + 2ρ k2 σ 1k 2 σ 2k 2 )/8dfσ 4 + σ dk 2 /σ 2df (1.15) var (δ k ) = δ k 2 /2df + σ dk 2 /σ 1k 2 df (1.16) where σ 2 = (σ 1k 2 + σ 2k 2 2 2 )/2, df = n k 1, and σ dk = σ 1k + σ 2k 2 2ρ kσ 1k σ 2k.

6 1.4 Risk Ratio, Odds Ratio, and Risk Difference Two-group Designs When the response variable is dichotoous in a two-group design, each participant can be classified into one of four possible outcoes, and the probabilities of being classified into these four outcoes can be suarized in a 2 2 contingency table, as shown below in the table to the left, where π ij is the unknown probability of a person in group j exhibiting level i of the dichotoous (e.g., Event vs. Non-event) response variable. Saple data for a two-group design in study k with a dichotoous response variable can be suarized in a 2 2 table of observed frequency counts as shown below in the table to the right where n jk is the saple size in group j. Group 1 Group 2 Group 1 Group 2 Event π 11k π 12k Event f 11k f 12k Non-event π 21k π 22k Non-event f 21k f 22k n 1k n 2k The following three effect-size easures for study k are coonly used in two-group studies with a dichotoous response. Risk Difference: θ k = [π 11k /(π 11k + π 21k )]/[π 12k /(π 12k + π 22k )] (1.17) Risk Ratio: ψ k = π 11k /(π 11k + π 21k ) π 12k /(π 12k + π 22k ) (1.18) Odds Ratio: ω k = π 11 π 22 /(π 12 π 21 ) (1.19) The risk ratio and risk difference are ore intuitive and easier to interpret than the odds ratio. The risk difference is popular because the inverse of the absolute risk difference, referred to as the nuber needed to treat (NNT), has a useful interpretation in treatent vs. control designs. NNT is the nuber of people that need to be treated to prevent one person fro having an adverse event. A criticis of the risk difference is that it tends to exhibit greater variability across studies than either the odds ratio or the risk ratio. The odds ratio is a popular easure of effect size partly because inferential statistical ethods for odds ratios have good sall-saple properties. However, this advantage is no longer as iportant given recently developed inferential ethods for the risk ratio and the risk difference that have been shown to have excellent sall-saple properties.

7 A Price-Bonett estiator of the risk difference in study k is θ k = π 1k π 2k (1.20) where π 1k = (f 11k + 1/)/( n 1k + 2/), π 2k = (f 12k + 1/)/( n 2k + 2/) and is the nuber of studies that will be cobined or copared in the eta-analysis. The estiated variance of θ k is var (θ k) = π 1k (1 π 1k )/(n 1k + 2) + π 2k (1 π 2k )/(n 2k + 2). (1.21) Log-transfored odds ratios or risk ratios are used when cobining and coparing risk ratios or odds ratios fro ultiple studies. A Price-Bonett estiator of the log-risk ratio in study k is ln (ψ k) = ln [(f 11k + 1 )/(n 4 1k + 7 )]/[(f 4 12k + 1 )/(n 4 2k + 7 )] (1.22) 4 and its estiated variance is var {ln(ψ k)} = 1 {f 11k + 1 4 + (f11 + 1 4 )2 n 1k f 11k + 3 } 2 + 1 {f 12k + 1 4 + (f 12k + 1 4 )2 n 2k f 12k + 3 } 2. (1.23) An estiator of the log-odds ratio in study k is ln (ω k) = ln [(f 11k + 1 )(f 2 22k + 1 )/{(f 2 12k + 1 )(f 2 21k + 1 )}] (1.24) 2 and its estiated variance is var {ln(ω k)} = 1 f 11k + 1 2 + 1 f 12k + 1 + 1 f 2 21k + 1 + 1 f 2 22k + 1 2. (1.25) Paired-saples Designs In a paired saples design where each participant produces two dichotoous responses, there are four possible response patterns which can be represented by a 2x2 contingency table with coluns representing the two categories of the first response and the rows representing the two categories of the second response. The probabilities of the four cobinations of responses are shown below in the table to the left. The observed frequency counts for one rando saple of participants shown below in the table to the right. The saple size in study k is n k = f 11k + f 12k + f 21k + f 22k.

8 Response 1 Response 1 Event Non-event Event Non-event Event π 11k π 12k Event f 11k f 12k Response 2 Response 2 Non-event π 21k π 22k Non-event f 21k f 22k As entioned previously, paired saples designs have several special cases which include the pretest-posttest design, the within-subject experiental design, and the longitudinal design. In a paired-saples design, the risk difference in study k is defined as θ k = (π 11k + π 21k ) (π 11k + π 12k ) (1.26) = π 21k π 12k. A Bonett-Price estiator of the risk difference in a paired-saples design for study k is θ k = π 21k π 12k (1.27) and its estiated variance is var (θ k) = {π 12k + π 21k θ k2 }/(n k + 2/) (1.28) where π 12k = (f 12k + 1/)/(n k + 2/) and π 21k = (f 21k + 1/)/(n k + 2/). 1.5 Pearson and Spearan Correlations Consider a study population in which each person is assigned two quantitative scores. The two quantitative scores for person i will be denoted as y i and x i. The population Pearson correlation, denoted as ρ, is a paraeter that describes the linear association between two quantitative variables. The Pearson correlation can range fro -1 to 1. A Pearson correlation value of -1 or 1 indicates a perfect linear association between x and y, while a Pearson correlation value of 0 indicates no linear association between x and y. The Spearan correlation describes the strength of a onotonic relation between x and y. The estiate of the Spearan correlation is a Pearson correlation applied to the rank orders of x and the rank orders of y. The Spearan correlation is less influenced by extree x or y scores than the Pearson correlation. If x and y have a nonlinear but onotonic relation, the Spearan correlation can be substantially larger than the Pearson correlation.

9 In a rando saple of n k participants in study k, an estiator of ρ k is ρ k = σ yx /σ yk σ xk (1.29) where σ yx = n k i=1 (y ik μ yk)(x ik μ xk)/n k. Replacing x i and y i in study k with their rank order values in Equation 1.29 gives an estiate of a Spearan correlation. An estiate of the variance of ρ k (for the purpose of cobining and coparing correlations fro ultiple studies) is var (ρ k) = (1 ρ k2 ) 2 /(n k 3) (1.30) for a Pearson correlation and var (ρ k) = (1 ρ k2 ) 2 (1 + ρ k/2)/(n k 3) (1.31) for a Spearan correlation. Equation 1.30 assues that x and y have an approxiate bivariate noral distribution in the study population. Equation 1.31 akes a uch less restrictive assuption that there is soe onotonic transforation (i.e., rank preserving) of x and soe onotonic transforation of y that will produce an approxiate bivariate noral distribution in the study population. Equation 1.31 assues that x and y are continuous variables, but this variance estiate should perfor properly with discrete ordinal or interval scale easureents for which 10 or ore values are likely to be observed in the saple. If x is dichotoous, then the Pearson correlation between x and y is called a point-biserial correlation. 1.6 Regression Slope Coefficients The following linear regression odel describes an assued linear relation between x and y for a randoly selected person y i = β 0 + β 1 x i + e i (1.32) where β 0 is the population y-intercept and β 1 is the population slope coefficient. In nonexperiental designs, β 1 describes the change in the predicted y score associated with a 1-point increase in x. In an experiental design where participants are randoized into the levels of x, β 1 describes the change in the population ean of y caused by a 1-point

10 increase in x. In applications where the etrics of x and y are well understood and have been easured using the sae scales in all studies, β 1 ight be a ore useful or eaningful easure of association than ρ. In a rando saple of n k participants in study k, an estiator of β 1k is β 1k = ρ k(σ yk /σ xk ) (1.33) and the estiated variance of β 1k in study k is var (β 1k ) = [σ yk 2 (1 ρ k2 )(n 1)]/ [σ x2 (n 1)(n 2)]. (1.34) Note that Equations 1.33 and 1.34 can be coputed using the Pearson correlation, standard deviation of y, and standard deviation of x reported in each study. 1.7 Reliability and Agreeent Coefficients Cronbach Alpha Reliability Cronbach s alpha reliability, denoted as ρ α, is one of the ost widely used easures of reliability in the social and behavioral sciences. Cronbach s alpha reliability describes the reliability of a su (or average) of q easureents where the q easureents ay represent q raters, q occasions, q alternative fors, or q questionnaire/test ites. When the easureents represent ultiple questionnaire/test ites, which is the ost coon application, Cronbach s alpha is referred to as a easure of internal consistency reliability. Cronbach s alpha assues that the q easureents have equal covariances but variances are not required to be equal. Although Cronbach alpha is not a easure of effect size, it is iportant to have soe idea about the value of ρ α in any study that uses a su or average of q easureents as a response variable or predictor variable in a statistical analysis. The unreliability of the response variable reduces the power and precision of inferential statistical ethods. Furtherore, unreliability of a predictor variable in a siple linear regression odel attenuates a slope estiate, the unreliability of a response variable attenuates a standardized ean difference, and a Pearson correlation is attenuated by the unreliability of each variable.

11 An estiator of ρ α is obtained fro a rando saple of n participants who are each assigned q scores. An estiator of ρ α fro study k is q 2 ρ αk = q[1 i=1 σ ik 2 /σ yk ]/(q 1) (1.35) 2 where σ ik is the estiated variance of the i th 2 easureent and σ yk is the estiated variance of the su of the q easureents in the k th study. An estiate of the variance of ρ αk is var (ρ αk ) = 2q(1 ρ αk ) 2 /[(q 1)(n k 2)]. (1.36) For the purpose of cobining and coparing Cronbach alpha coefficients fro studies, soe research recoends replacing n k 2 with n k 2 a where a = [(q 2)( 1)] 1/4. G-agreeent In interrater reliability studies where two raters independently classify each eber of a rando saple into one of two categories, the results can be suarized in a 2 2 contingency table with the coluns representing the classification assignents of the first rater and the rows representing the classification assignents of the second rater. The probabilities of the four rater cobinations are shown below in the table to the left. The observed frequency counts for one rando saple of participants are shown below in the table to the right. The saple size in study k is n k = f 11k + f 12k + f 21k + f 22k. Rater 1 Rater 1 1 2 1 2 1 π 11k π 12k 1 f 11k f 12k Rater 2 Rater 2 2 π 21k π 22k 2 f 21k f 22k In the social and behavioral sciences, the kappa coefficient is the ost coonly used easure of interrater agreeent. However, the value of kappa is highly sensitive to the arginal proportions and large differences in kappa across studies ay be due solely to uninteresting differences in the arginal proportions. An alternative easure of interrater agreeent that is easy to interpret and is less sensitive to the arginal proportions is Guilford s index of agreeent, (G-agreeent) which ay be expressed as γ k = 2(π 11k + π 22k ) 1. (1.37)

12 A Price-Bonett type estiator of γ k is γ k = 2π ok 1 (1.38) and its estiated variance is var (γ k) = 4π ok (1 π ok )/(n k + 4/) (1.39) where π ok = (f 11k + f 22k + 2/)/(n k + 4/) and is the nuber of studies that will be cobined or copared. In interrater reliability studies, it is often useful to also copute Equation 1.26 to assess the difference between the two raters in ters of the likelihood of assigning participants to category 1. 1.7 Effect-size Conversions Soe dichotoous variables, such as ale/feale or Treatent A/Treatent B, are called naturally dichotoous variables. In coparison, a quantitative attribute that has been easured on a dichotoous scale is called an artificially dichotoous variable. Artificially dichotoous variables are often used when it is difficult to quantitatively assess a quantitative attribute. It is possible to approxiate an effect size for one or two quantitative variables based on an effect size that has been coputed fro one or two artificially dichotoous variables. For exaple, if an odds ratio has been coputed fro one artificially dichotoous variable and one naturally dichotoous variable, it is possible to approxiate a standardized ean difference and a point biserial correlation (a correlation between a dichotoous variable and a quantitative variable) for the naturally dichotoous variable and the quantitative attribute. If an odds ratio has been coputed fro two artificially dichotoous variables, it is possible to approxiate a Pearson correlation between the two quantitative attributes. There is an exact relation between a point-biserial correlation and a standardized ean difference. If δ is estiated using a pooled within-group variance, the estiator of δ can be transfored into the traditional point-biserial correlation estiator. If δ is estiated using separate within-group variances (the preferred ethod), the estiator of δ can be transfored into an unequal-variance estiator of the point-biserial correlation. The other conversions given below are approxiations, and an effect-size estiator based on an approxiate transforation will be denoted as θ to distinguish it fro the direct estiator θ.

13 Odds ratio to Standardized Mean Difference δ ln(ω ) 1.7 (1.40) var(δ ) = 0.346var[ln(ω )] (1.41) where var[ln(ω )] is given by Equation 1.24. Odds Ratio to Point-biserial Correlation ln(ω ) ρ pb ln(ω ) 2 + c (1.42) var(ρ pb ) = [c/(ln(ω ) 2 + c) 3/2 ] 2 var[ln(ω )] (1.43) where c = 2.89/pq, p = n 1 /n, q = 1 p, and var[ln(ω )] is given by Equation 1.24. Odds Ratio to Pearson Correlation ρ ω 3/4 1 ω 3/4 + 1 (1.44) var(ρ ) = [6/{4ω 1/4 (ω 3/4 + 1) 2 }]var[ln(ω )] (1.45) where var[ln(ω )] is given by Equation 1.24. The following approxiation to the Pearson correlation is ore accurate than Equation 1.44 if the arginal proportions of the 2x2 contingency table have been reported. ρ cos [ 3.142 1 + ω b ] (1.46) var(ρ ) = [(3.142bω b) sin { 3.142 1 + ω b} /(1 + ω b) 2 ] 2 var[ln(ω )] (1.47) where b = {1 p 1+ p +1 ( 1 p 5 2 in) 2 )/2 and p in is the sallest arginal proportion. Standardized Mean Difference to Log-odds Ratio ln(ω ) 1.7δ (1.48) var[ln(ω )] = 2.89var(δ ) (1.49) where var(δ ) is given by Equation 1.12.

14 Standardized Mean Difference to Point-biserial Correlation δ ρ pb = δ 2 + 1 pq (1.50) var(ρ pb ) = [1/{pq (δ 2 + 1 pq )3/2 }] 2 var(δ ) (1.51) where p = n 1 /n, q = 1 p, and var(δ ) is given by Equation 1.12. Point-biserial to Standardized Mean Difference δ = ρ pb 1/pq 1 2 ρ pb (1.52) var(δ ) = [c/(1 ρ pb 2 ) 3/2 ] 2 var(ρ pb ) (1.53) where c = 1/pq and var(ρ pb ) is given by Equation 1.48. Point-biserial Correlation to Log-odds Ratio ln(ω ) aρ pb 1 2 ρ pb (1.54) a(2 ρ pb) var[ln(ω )] = {4(1 ρ pb ) 3 var(ρ pb) (1.55) } where a = 2.89/pq, p = n 1 /n, q = 1 p, and var(ρ pb ) is given by Equation 1.48 Pearson Correlation to Log-odds Ratio ln(ω ).75ln( 1 + ρ 1 ρ ) (1.56) var[ln(ω )] = (9/4)( 1 (1 ρ ) where var(ρ ) is given by Equation 1.30. 4)var(ρ ) (1.57) Coents 1. G-agreeent corrects for chance agreeent and ay be expressed as (π 0 π c )/(1 π c ) where π 0 = π 11k + π 22k is the probability of both raters agreeing and π c =.5 is the probability of agreeent if both raters are siply guessing.

15 2. The variance estiates for the standardized ean differences assue the response variable has an approxiate noral distribution. The variance estiates for an estiated ean difference in study k ( k) in 2-group and paired-saples designs do not ake any distributional assuptions. 3. The variance estiates of the standardized ean differences tend to be too sall when the response variable is leptokurtic (i.e., ore peaked and/or longer tails than a noral distribution) and tend to be too large when the response variable is platykurtic (i.e., less peaked and/or shorter tails than a noral distribution). 4. The variance estiate of the Pearson correlation tends to be too sall when x and y are leptokurtic and tends to be too large when x and y are platykurtic. 5. Soe eta-analysis ethods use a Fisher s inverse hyperbolic tangent transforation of the Pearson correlation defined as ρ k = arctanh(ρ k ) = [ln(1 + ρ k) ln(1 ρ k)]/2 with estiated variance var (ρ k ) = 1/(n 3). 6. Soe eta-analysis ethods in Chapters 3 and 4 use a log-coplient transforation of Cronbach s alpha with estiated variance var [ln(1 ρ α)] = 2q/[(q 1)(n 2 a)] where a = [(q 2)( 1)] 1/4. 7. The variance estiate of Cronbach s alpha (Equation 1.36) assues q parallel easureents. Parallel easureents have equal variances and equal covariances. 8. Soe latent variable odel paraeters are correlation paraeters such as the correlation between two factors or the standardized factor loading in an orthogonal factor analysis odel. When these types of correlations are used in a eta-analysis, the correlation estiate and its standard error for each study is obtained fro the latent variable coputer progra (e.g., MPlus, EQS, Aos, lavaan) rather than Equations 1.29 and 1.30. Note that Equation 1.30 does not apply to correlation paraeter estiators in a latent variable odel. 9. Soe researchers convert various effect sizes into a Pearson correlation or a standardized ean difference and then incorrectly use Equation 1.30 or Equation 1.12, respectively, as the variance of the approxiated effect size estiator. Also, soe researchers incorrectly use Equation 1.30 as the variance of a point-biserial correlation estiator.

16 Chapter 2 Meta-analysis Models 2.1. Multi-study Statistical Models If a particular paraeter has been estiated in 2 studies, the estiates ight be coparable if very siilar variables have been used in all k studies and (in studies with two-group designs) the two group conditions are very siilar in all k studies. The sybol θ k will be used to denote any of the population paraeters defined in Chapter 1 for study k (k = 1 to ). In soe cases θ k will represent a transfored paraeter such as a log odds ratio or a log risk ratio. The estiate of θ k and its estiated variance will be denoted as θ k and var(θ k), respectively. The estiate θ k is saple realization of a rando variable (called the estiator of θ k ) and hence a statistical odel ay be used to represent its large-saple expected value of the estiator and the rando deviation fro expectation. Three basic types of statistical odels the constant coefficient odel, the varying coefficient odel, and the rando coefficient odel ay be used to represent the estiators fro ulti-study designs. These three odels are described in the following sections. 2.2 Constant Coefficient Model The constant coefficient odel can be expressed as θ k = θ + ε k (2.1) where θ is the large-saple expected value of each θ k. The constant coefficient odel assues that every θ k is an estiator of the sae quantity (θ) which is referred to as the effect-size hoogeneity assuption. The constant coefficient odel is a type of fixed-effect odel because θ is a constant (i.e., a fixed effect) and not a rando variable. The disturbances ε k (k = 1 to ) are rando variables that are assued to be independent and norally distributed but are not assued to be hooscedastic. Stratified rando sapling is typically assued in which a rando saple of size n k is obtained fro different study populations. In two-group designs n k = n 1k + n 2k.

17 If the effect-size hoogeneity assuption can be satisfied, the following inverse-variance weighted estiator of θ is efficient (i.e., has a variance that is saller than any other estiator) and nearly unbiased if all n k are large θ = k=1 w kθ k / k=1 w k (2.2) where w k = 1/[var(θ k)]. The estiated variance of θ is var(θ ) = 1/ k=1 w k and can be used with θ to construct an approxiate confidence interval for θ. This variance estiate is called a large-saple estiate because it assues that each w k is coputed fro a large rando saple. If effect-size hoogeneity cannot be assued, Equation 2.2 is biased for any saple size and its large-saple bias (i.e., inconsistency) is approxiately equal to θ k=1 w kθ k / k=1 w k. (2.3) Note that the bias vanishes if all weights (w k) are equal or if all population effect sizes are equal. Recall that var(θ k) is a function of the saple size. The saple sizes will typically vary across studies and hence the weights will typically be unequal. With unequal weights, effect-size hoogeneity is a crucial assuption in the constant coefficient odel. Soe researchers copute the following chi-square test statistic with df = 1 to test the null hypothesis H0: θ 1 = θ 2 = = θ Q = k=1 w k (θ k θ ) 2 (2.4) and conclude that effect-size hoogeneity has been satisfied if the result is nonsignificant. However, failure to reject this type of null hypothesis cannot be interpreted as evidence that H0 is true. There is no statistical test that can be used to show that θ 1 = θ 2 = = θ. The best we can do is test H1: θ k = θ k < ε against H2: θ k = θ k ε for all pairwise coparisons where ε is a value that represents a sall and uniportant difference in paraeter values. This type of test is called an equivalence test, and ideally each pairwise coparison would be tested using a Bonferroni-adjusted α. For acceptably sall values of ε, equivalence testing requires extreely large within-study saple sizes (n k ) in all studies to reject H2 and accept H1, and consequently it is practically

18 ipossible to statistically show that the effect-size hoogeneity assuption has been approxiately satisfied. If the constant coefficient odel and Equation 2.2 are used, it is necessary to provide a copelling logical arguent that the effect size hoogeneity assuption is realistic in the proposed application. 2.3 Varying Coefficient Model The varying coefficient odel can be expressed as θ k = θ + υ k + ε k (2.5) where k=1 υ k = 0 (this is called a side condition and is required to identify the odel), θ = 1 k=1 θ k, and υ k represents effect-size heterogeneity due to all unknown or unspecified differences in the characteristics of the study populations. We could also express the separate study population paraeter values as θ k = θ + υ k. The varying coefficient odel is a fixed-effects odel because θ and each υ k are constants (i.e., fixed effects) and not rando variables. The disturbances ε k (k = 1 to ) are rando variables that are assued to be independent and norally distributed but are not assued to be hooscedastic. Stratified rando sapling is typically assued in which a total rando saple of size n k is obtained fro different study populations. In two-group designs n k = n 1k + n 2k. In a varying coefficient odel, the researcher will want to estiate interesting linear functions of θ k such as θ = 1 k=1 θ k or h 1 θ 1 + h 2 θ 2 + + h θ where h k is a nuerical value specified by the researcher. Methods of estiating linear functions of θ k for each type of paraeter described in Chapter 1 are described in Chapters 3 and 4. Chapter 4 also describes ethods for estiating the paraeters of a linear odel θ k = β o + β 1 x 1 + β 2 x 2 + + β q x q + ε k for each of the paraeters described in Chapter 1. The ain advantage of the varying coefficient odel over the constant coefficient odel is that we do not need to assue or test for effect size hoogeneity. All of the statistical procedures for the varying coefficient odel perfor properly in the presence of effectsize heterogeneity.

19 2.4 Rando Coefficient Model The rando coefficient odel can be expressed as θ k = θ + υ k + ε k (2.6) where each υ k is assued to be a norally distributed rando variable with ean 0 and standard deviation τ. The paraeter value for study k is assued to equal θ + υ k. The disturbances ε k (k = 1 to ) are rando variables that are assued to be independent and norally distributed but are not assued to be hooscedastic. The disturbances are also assued to be uncorrelated with υ k. To justify the clai that υ k in Equation 2.6 is a rando variable, two-stage cluster sapling can be assued. One way to conceptualize two-stage cluster sapling is to randoly select study populations fro a superpopulation of M study populations and then take a rando saple of size n k fro each of the randoly selected study populations. In eta-analysis applications, a twostage cluster saple is not realistic but it is coon for researchers to iagine that the studies selected for analysis could potentially represent a rando saple fro soe hypothetical superpopulation of study populations. Fro this perspective, θ is the ean and τ is the standard deviation of the θ 1, θ 2,, θ M paraeters in the superpopulation. In a rando coefficient odel, the goal is to estiate both θ and τ. If the assuptions of the rando coefficient odel can be satisfied, the following inverse-variance weighted estiator of θ is efficient and nearly unbiased if all n k are large θ = k=1 / k=1 w k (2.7) w k θ k where w k = 1/[var(θ k) + τ 2 ] and τ 2 is the following DerSionian-Laird estiate of τ 2 2 τ = (Q df)/[ k=1 w k k=1 w k2 / k=1 w k ]. (2.8) The estiated variance of θ is var(θ ) = 1/ k=1 w k and can be used with θ to construct an approxiate confidence interval for θ. Equation 2.7 assues the weights (w k ) are uncorrelated with the effect sizes. It can be shown that θ is a biased estiator of θ (even with large n k ) when the weights are correlated with effect sizes. The approxiate large-saple bias of θ is equal to θ [θ + ρ wθ σ w τ/ k=1 w k ] (2.9)

20 where ρ wθ is the population correlation between the weights and the effect sizes and w k is the study population value of w k. Note that the large-saple bias of θ vanishes if ρ wθ = 0, or if all weights in a superpopulation of study populations are equal (σ w = 0), or if all effect sizes in the superpopulation are equal (τ = 0). In practice, τ > 0 and the weights will not be equal. Thus, the assuption of a zero correlation between weights and effects sizes is crucial for Equation 2.7. When using a rando coefficient odel to analyze data fro ultiple studies, it is iportant to report a confidence interval for both θ and τ because the estiates of θ and τ contain sapling error of unknown agnitude and direction. However, the currently-available confidence intervals for τ are hypersensitive to the kurtosis of the superpopulation distribution of paraeter values, and this assuption should be assessed before using a rando coefficient odel. Soe researchers report the I 2 easure of heterogeneity rather than a confidence interval for τ. However, I 2 is equal to 1 ( 1)/Q where Q is the chi-squared effect-size hoogeneity test statistic (Equation 2.4). I 2 is not an estiate of any population paraeter and its value is heavily influenced by the within-study saple sizes. For any nonzero value of τ, I 2 (like Q) will be saller with saller within-study saple sizes and larger with larger within-study saple sizes. The reporting of I 2 is not recoended. Most rando coefficient eta-analyses use Equation 2.7 to estiate θ and Equation 2.8 to estiate τ 2. An older approach (referred to as the Hunter-Schidt approach) for Pearson correlations and two-group standardized ean differences is still popular aong industrial psychologists. The Hunter-Schidt approach estiates θ with Equation 2.7 using w k = n k. The Hunter-Schidt estiate of τ 2 is 2 n τ = [ k (ρ k θ ) 2 k=1 ] (1 θ 2 ) 2 k=1 n k k=1 n k for Pearson correlations (where θ = k=1 n k ρ k / k=1 n k ) and τ 2 = [ n k (δ k θ ) 2 k=1 k=1 n k ] ] 4 1 + θ 2 8 k=1 n k (2.10a) (2.10b) for two-group standardized ean differences (where θ = k=1 n k δ k / k=1 n k ). The second ters in Equations 2.10a and 2.10b are estiates of the within-study sapling

21 variance and are based on a siplifying assuption that τ 2 = 0, and hence these estiates are not expected to be accurate unless τ 2 is sall. 2.5 Model Choice Recoendations The effect-size hoogeneity assuption of the constant coefficient odel is difficult to justify in ost eta-analysis applications. As noted above, a nonsignificant chi-square test for effect-size hoogeneity cannot be used to justify the use of a constant coefficient odel. Furtherore, an equivalence test of θ j θ k < ε for all j, k pairs of studies requires extreely large saple sizes in all studies. The constant coefficient odel is not recoended in applications where it is not possible to show that all θ k values are siilar. With effect-size hoogeneity, Equation 2.2 is a biased and inconsistent estiator of the coon effect size (θ), and the traditional confidence interval for θ will have a true coverage probability that can be far less than the specified confidence level. Until recently, the constant coefficient odel was very popular but ost statisticians now agree that the constant coefficient odel should no longer be used. Given the difficulty in assessing the effect-size hoogeneity assuption and the bias of Equation 2.2 as an estiator of 1 k=1 θ k when this assuption is not satisfied, researchers will need to choose between the varying coefficient odel and the rando coefficient odel. The ain advantage of the rando coefficient odel over the varying coefficient odel is the potential of the rando coefficient odel to provide statistical inference to a superpopulation of M effect sizes or study populations. Of course, statistical inference to a superpopulation is based on a very strong and unrealistic assuption that the studies have been randoly sapled fro soe specified superpopulation. Strictly speaking, the varying coefficient odel provides statistical inference only to the < M study populations. However, the authors of each of the studies would argue that their results should generalize to other siilar study populations, and the results fro a varying coefficient odel will generalize to this extended set of study populations of the studies. Thus, a superpopulation of study populations that could be described by a rando coefficient odel ight not be any ore interesting or scientifically iportant than the extended set of study populations that can be described by a varying coefficient odel.

22 Three conditions ust be et before a rando coefficient odel can be recoended over a varying coefficient odel. The first condition requires a clearly defined superpopulation to which the rando coefficient odel results will apply along with an arguent that this superpopulation has greater scientific iportance than the set of extended studies populations that could be described by a varying coefficient odel. If the first condition can be et, the next condition requires evidence that the weights in Equation 2.7 are uncorrelated with effect sizes in the superpopulation. There is no standard ethod for assessing this assuption, and reeber that a nonsignificant result for test of H0: ρ wθ = 0 does not iply that the superpopulation correlation is zero. An approxiate confidence interval for ρ wθ could be coputed, but a large nuber of studies would be needed to obtain an inforatively narrow interval. For instance, about = 400 studies would needed to obtain a 95% confidence interval for ρ wθ that has a width of about.2 assuing ρ wθ is close to zero. The third condition that ust be et before using a rando coefficient odel is a deonstration of superpopulation norality. The currently-available confidence intervals for τ are hypersensitive to sall aounts of leptokurtosis (i.e., superpopulation distributions with greater peaks or thicker tails than a noral distribution). Tests of leptokurtosis require a large nuber of studies ( > 100) to detect a degree of leptokurtosis in the superpopulation that would seriously degrade the perforance of the traditional confidence intervals for τ. Furtherore, a large nuber of studies are usually needed to obtain an inforatively narrow confidence interval for τ. Unless is large, the varying coefficient odel will alost autoatically be preferred to a rando coefficient odel. If is large, the varying coefficient odel is also recoended in situations where there are concerns about superpopulation leptokurtosis, the possibility of a non-trivial value of ρ wθ, or if the assued superpopulation of study populations is not substantially greater than the set of extended study populations to which a varying coefficient results would apply. Statistical ethods for the constant coefficient odel and the rando coefficient odel are described in any eta-analysis textbooks and will not be reviewed here. The newer varying coefficient ethods have not been presented in eta-analysis texts, and these ethods will be described in Chapters 3 and 4.

23 Coents 1. Soe proponents of the constant coefficient odel agree that Equation 2.2 is a biased estiator of θ = 1 k=1 θ k but argue that in large saples Equation 2.2 is a nearly unbiased estiator of k=1 w kθ k / k=1 w k. However, the w k values reflect saple characteristics (such as the saple sizes) and so the quantity k=1 w kθ k / k=1 w k is not a eaningful population paraeter. 2. The traditional confidence intervals for the superpopulation standard deviation (τ) are highly sensitive to inor non-norality. Leptokurtosis causes the ost serious probles. For exaple, a t(5) distribution is syetric and bell-shaped with excess kurtosis of 6. Even with this sall aount of excess kurtosis, a 95% confidence interval for τ in a t(5) superpopulation distribution will have a coverage probability that can be uch less than.95 (even for large saple sizes and a large nuber of studies). Even with the ost powerful test of leptokurtosis (i.e., a one-sided test using Pearson s kurtosis estiate), a very large nuber of studies ( > 100) is needed to detect a t(5) superpopulation distribution. 3. The nuber of studies needed to estiate τ with desired precision and confidence is approxiately = 2[z α/2 /ln (r)] 2 + 2 where r is the desired upper liit to lower liit ratio. This saple size forula assues large saple sizes in each study. For exaple, to obtain a 95% confidence interval for τ that has an upper to lower confidence liit ratio of 1.25, at least = 2[1.96/ln(1.25)] 2 + 2 157 studies would be required. Thus, the rando coefficient odel could require a large to obtain a confidence interval for τ that is not eaninglessly wide. 4. A Wald confidence interval for τ using a DerSionian-Laird estiate is given in Borenstein, et al. 2009). A confidence interval for τ using a Hunter-Schidt estiate is not available. 5. The weighted average estiators have enorous intuitive appeal to researchers. However, unless the effect-size hoogeneity assuption is satisfied in the constant coefficient odel or the zero weight-estiator correlation assuption is satisfied in the rando coefficient odel, the popular weighted average estiators can give highly isleading results. Note that all of the effect-size estiators described in Chapter 1, except for the unstandardized ean difference and the Fisher transfored Pearson correlation, have a variance that is a function of the effect size (although the relation is not obvious for soe effect size easures), which results in a correlation between the weights and the estiators. 6. The rando coefficient odel is often applied to Fisher-transfored Pearson correlations. A weighted average of these transfored correlations is used to copute θ* and then this average is reverse transfored to provide an estiator of the average Pearson correlation. However, the reverse transforation processes introduces bias into the estiator regardless of saple size. Siilar probles occur when the rando coefficient odel is applied to the log-coplient transfored Cronbach alpha reliability coefficients. Furtherore, the τ paraeter does not have a useful interpretation with Fisher-transfored correlations or log-coplient transfored Cronbach alpha reliabilities.

24 Chapter 3 Cobining Results fro Multiple Studies 3.1. Mean Differences An estiate of the average study population ean difference ( = 1 k=1 k = 1 k=1 k (3.1) and its estiated variance is var ( ) = 2 k=1 var ( k) (3.2) where var ( k) is given in Equation 1.4 for two-group designs and Equation 1.8 for paired-saples designs. ) is An approxiate two-sided (1 α)% confidence interval for is ± z α/2 var( ) (3.3) where z α/2 is a two-sided critical z-value. The sall-saple accuracy of Equation 3.3 can be iproved by replacing z α/2 with t α/2;df where df = [ 2 2 2 σ jk k=1 j=1 ] n jk 2 j=1 n jk 4 σ jk 2 (n jk 1) /[ k=1 ] (3.4) when all studies are two-group designs and df = [ 2 2 σ dk k=1 ] n k 4 σ dk n 2 k (n k 1) /[ k=1 ] (3.5) when all studies are paired-saples designs. In ost ulti-study designs will be coputed using only two-group designs or only paired-saples designs. In soe cases it could be appropriate to cobine ean

25 difference estiates fro two-group and paired-saples designs. For instance, effect size estiates fro two-group treatent vs. placebo studies ight be coparable with effect size estiates fro a pretest-posttest design; or effect size estiates fro two-group designs that copares two different treatents ight be coparable to effect size estiates fro a within-subject experient where participants receive both treatents in rando order. Exaple 3.1. Two eye-witness identification studies assessed participants certainty in their selection of a suspect individual fro a photo lineup after viewing a short video of a crie scene. Two treatent conditions were assessed in each study. In the first treatent condition the participants were told that the target individual will be in a 5-person photo lineup, and in the second treatent condition participants were told that the target individual ight be in a 5-person photo lineup. The suspect was included in the lineup in both instruction conditions. Both studies used the sae 1-10 certainty rating scale, and both studies sapled fro study populations of introductory psychology students at large public universities. The suary inforation for the two studies is shown below. Study n 1k n 2k μ 1k μ 2k σ 1k σ 2k k var ( k) 1 40 40 7.4 6.3 1.7 2.3 1.1 0.452 2 20 20 6.9 5.7 1.5 2.0 1.2 0.559 Applying Equation 3.1 gives = (1.1 + 1.2)/2 = 1.15, and applying Equation 3.2 gives var ( ) = (0.452 2 + 0.559 2 )/4 = 0.1295. Coputing Equation 3.4 gives df = 79.9 so that t α/2;df = 1.99. A 95% confidence interval for the average study population ean difference is 1.15 ± 1.99 0.1295 = [0.43, 1.87]. We can be 95% confident that the ean certainty rating under a will be instruction is 0.43 to 1.86 greater than the ean certainty rating under a ight be instruction in the two university study populations. Exaple 3.2. Five studies used a pretest-posttest design and reported the effect of relaxation therapy on hours of igraine headaches per week. All five studies sapled fro study populations of adults fro five large etropolitan areas who responded to a newspaper request for volunteers. The suary inforation for the five studies is shown below. Study n k μ 1k μ 2k σ 1k σ 2k ρ k k var ( k) 1 45 20.1 10.4 9.3 7.8.87 9.70 0.685 2 15 20.5 10.2 9.9 8.0.92 10.30 1.042 3 20 19.3 8.5 10.1 8.4.85 10.80 1.190 4 20 21.5 10.3 10.5 8.1.90 11.20 1.067 5 30 19.4 7.8 9.8 8.7.88 11.60 0.850 Applying Equation 3.1 gives = (9.7 + 10.3 + + 11.6)/5 = 10.72, and applying Equation 3.2 gives var ( ) = (0.685 2 + 1.042 2 + + 0.850 2 )/25 = 0.1933. Coputing Equation 3.5 gives df = 83.1 so that t α/2;df = 1.99. A 95% confidence interval for the average study population ean difference is 10.72 ± 1.99 0.1993 = [9.85, 11.59]. We can be 95% confident that the population ean igraine hours per week would be reduced by 9.85 to 11.59 hours if all ebers of the five study populations received the relaxation therapy.