1 International Journal of Epidemiology O International Epidemlologlcal Association 1996 Vol. 25. No. 2 Printed In Great Britain Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs BETH C GLADEN Gladen B C (Statistics and Biomathematics Branch, Mail Drop A3-03, National Institute of Environmental Health Sciences, PO Box 12233, Research Triangle Park, NC 27709, USA). Matched-pair case-control studies when risk factors are correlated within the pairs. International Journal of Ep/demfo/ogy 1996; 25: Background. If pair members are independent, simple matched-pair case-control studies are known to yield consistent estimates of the population odds ratio. If pair members are not independent, this is not necessarily true. It has been shown previously that the usual matched-pair estimate remains consistent if the exposure of interest is correlated within the pairs. However, the effect of correlation of unmeasured risk factors within the pairs has not been studied. Methods. We examine the effect of wfthin-pair correlation of unmeasured risk factors independent of the measured exposure. This is done within the context of a simple matched-pair case-control study. We compare the large-sample expectation of the usual matched-pair estimate to the population odds ratio. Results. We show that the usual estimate may be inconsistent in the presence of this correlation. However, if the disease is rare, the magnitude of the bias will be negligible. Conclusions. Correlation of unmeasured risk factors independent of the measured exposure is not a practical problem in this setting. Keywords, bias (epidemiology), odds ratio, selection bias, epidemlological methods Matched-pair case-control studies can be used to study the relationship between a disease and an exposure of interest. In the simple version of such a study, we choose a random sample of cases and a matched control for each case. We determine whether each pair member is exposed. We calculate the ratio of the number of pairs with an exposed case and an unexposed control to the number of pairs with an exposed control and an unexposed case. Under the usual assumptions, this ratio will be a consistent (that is, unbiased in large samples) estimate of the population odds ratio. One of the usual assumptions is that everyone in the study is independent. If controls are chosen as, for example, random people from the same city and of the same age and sex as the case, this may be a reasonable assumption. If, however, the controls are siblings or spouses of the cases, the assumption of independence within pairs becomes less tenable. Such controls may be used because they are considered more appropriate; they may also be used for the practical reason that they are readily identified and likely to be willing to participate in a study. Since using these types of controls Statistics and Biomathematics Branch, Mail Drop A3-03, National Institute of Environmental Health Sciences, PO Box 12233, Research Triangle Park, NC 27709, USA. violates the usual assumptions, we need to check the behaviour of the estimate under these conditions. Goldstein, Hodge, and Haile looked at simple matched-pair case-control studies where the exposure of interest is correlated within the pairs. 1 In retrospective studies, what we are examining is the distribution of exposure. When exposure of a control is related to exposure of a case, it is reasonable to think this correlation may distort our inferences. However, Goldstein et at. demonstrated that the usual matched-pair estimate remains a consistent estimate of the population odds ratio despite the correlation of exposure (assuming no other assumptions are violated). A similar result appears in Pike and Robins 2 in a modification of the results of Flanders and Austin. 3 However, this is not completely reassuring since the correlation within pairs may well extend further. Although we are only interested in and only measure a single exposure, there are always other risk factors for the disease. The pair members may well be correlated on these other risk factors as well. These other risk factors may not be recognized, let alone measured. For example, suppose we are studying the relationship between a disease and some exposure; if the disease is thought to have a genetic component, but the genes responsible are unknown, sibling controls may be used. 420

2 EFFECT OF WITHIN-PAIR CORRELATION 421 The genetic risk factor cannot be measured, since the gene is unknown. The siblings may have correlated values of a variety of other unmeasured risk factors as well; these might include diet or socioeconomic status. Similarly, if a disease is known to vary by socioeconomic status or geographical location, neighbourhood controls may be used. The underlying risk factors may be unknown and thus unmeasured, and may be correlated within neighbourhoods. Neighbourhood is not a risk factor in itself, but a surrogate for these other risk factors. In this paper, we examine whether the usual estimate in a simple matched-pair case-control study remains consistent if correlation of risk factors (both the single measured one and the unmeasured ones) is present within pairs. Throughout, we will ignore precision; we are only concerned with bias. We will also assume that the unmeasured risk factors are independent of the measured exposure. Dependence would create a standard confounding situation where bias would be expected; under independence, one might expect to avoid problems. We explore whether this expectation is accurate. ASSUMPTIONS AND NOTATION Validity of a matched study is dependent on the rules which specify which non-cases are potential matched controls for each case. Certain schemes, such as use of friend controls, can cause bias. 2 " 5 This bias is avoided if the population from which cases arise can be divided into non-overlapping groups, and controls are chosen from the same group as the case; this has been called 'reciprocal design'. 2-5 We assume throughout that controls are chosen in this fashion. These non-overlapping groups might consist, for example, of sibship members or of residents of the same city block. For concreteness, we will assume that the groups in question are pairs, and we will call the pair members the wife and the husband. Assume a single dichotomous exposure of interest, denoted by E. This exposure will be the focus of the matched-pair case-control study. Let p and q denote the prevalences of the exposure for wives and husbands, respectively; we need not assume that they are equal. Assume that exposures of wife and husband are correlated, and let r denote the covariance. The joint probabilities of E are: P(wife is E, husband is E) = pq + r P(wife is E, husband ise) = p(l q) - r P(wife is E, husband is E) = (l-p)q - r P(wife is E, husband is E) = (1 p)(l q) + r If r = 0, the exposures of wife and husband are independent. Assume one other discrete risk factor (denoted F) with f categories. F can be thought of as subsuming all other risk factors, since it could actually be a composite of multiple, possibly dependent, risk factors; for example, level I is young white professionals, level 2 is old white professionals, level 3 is young white labourers, and so on. F will not be measured in the matched-pair study; it nevertheless plays a role in determining the distribution of disease in the pairs. Assume that F is correlated within pairs, but that F and E are independent. Assume that prevalences of F are the same for wives and husbands. Denote the marginal and joint probabilities for F by: Pr(wife is F ; ) = Pr(husband is F;) = x, Pr(wife is Fj and husband is F) = Pr(wife is ~ and husband is = x,xj + Zlj If z,j = 0 for all values of i and j, then the risk factors of husband and wife are independent. Finally, denote disease by D. Assume that disease risk depends only on E and F. In particular, assume that variations in disease risk from one pair to another are attributable solely to variations in E and F. Assume that, conditional on the risk factors, occurrence of disease in one individual is independent of occurrence in all others. Denote the disease probabilities by: Pr(D I E, Fj) = a, Pr(D I E, Fj) = bj Note that we do not assume that relative risks for E (that is, b/a,) are constant across the levels of F; this means effect modification is permitted. Thus, for example, we allow for the possibility that one factor is environmental, the other is genetic, and no elevation in risk occurs unless both are present. RESULTS Population Parameters We may derive the population values for relative risks and odds ratios for exposure through straightforward algebra; details are in the Appendix. First, we may show that the risk of disease conditional on exposure r is Pr(D I E) = X Xjb t. This is, of course, just a weighted average of the risks (bj) in the various levels of F, weighted by the frequencies (Xj). Similarly, we may show that Pr(DlE~) = Ex^. Then the relative risk

3 422 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY f f is Ix,b,/ Zxa-. A similar expression in the case I-I ' ' i-i ' ' where f = 2 is given by Khoury and James. 6 Similarly, the odds ratio is: f I I-I Note that the correlation parameters, ^ and z tj, do not enter into these expressions; the relative risk and odds ratios are the same whether or not exposures are correlated within the pairs. Matched Pair Estimate Suppose we do a matched-pair case-control study looking at the effect of E on D. F is not measured in such a study, but it affects the distribution of D nonetheless. By design, only those pairs discordant for D (that is, pairs with one case and one control) appear in the study. Of those, only those pairs discordant for E contribute to the usual estimate of the odds ratio. The expected number of pairs with an exposed case and an unexposed control will be proportional to: Pr(wife is E, D and husband is E, D) + Pr(wife is E, D and husband is E, D) which can be shown to be: [ P + q-2pq-2r]{[ix i (l-a,)][ x 1 b 1 ]-iiz 1J b 1 a J } i-i i.i i-ij-i ' ' Similarly, the expected number of pairs with an unexposed case and an exposed control will be proportional to: Pr(wife is E, D and husband is E, D) + Pr(wife is E, D and husband is E, D) = [p + q-2pq-2r]{[ix,a i ][ x i (l-b 1 )]-iiz li a j b 1 } i-i i-i i-ij-i ' The ratio of these two terms gives the expression for the large sample expectation of the estimated odds ratio: (1) Behaviour of Estimate Under Various Conditions First note that the distribution of the exposure E is irrelevant to the behaviour of the estimate; p, q, and r do not appear in expression (2). Expression (2) will be equal to the population odds ratio (1) in several circumstances. First, if the exposure of interest E is not actually a risk factor, there is no bias. This condition is equivalent to a; = b ; for all i. In this situation, both the population odds ratio and the large-sample expectation of the estimate will be 1. Second, if there is no correlation on F within pairs, there is no bias. This condition is equivalent to z,j = 0 for all i and j. Third, if F is not a risk factor within both exposure groups, there is no bias. If F does not affect disease risk among the unexposed, then a, = a for all i; it can be shown that there is no bias. In similar fashion, if F does not affect disease risk among the exposed, then b, = b for all i, and there will be no bias. Note that the case studied by Goldstein et al} had no risk factor F, which is equivalent to having both a, = a and bj = b; thus their results are a special case of the results obtained here. The behaviour of expression (2) in the rare disease case can be seen by letting disease rates go to zero with relative rates fixed. Simple calculus shows that the limit is the relative risk. Thus as the disease becomes rarer, both the large-sample expectation (2) and the population odds ratio (1) approach the population relative risk and the bias disappears. Example Suppose that F is dichotomous. The distribution of F can then be described by only two parameters, due to constraints mentioned in the Appendix. Thus, we have: Pr(wife is F,) = Pr(husband is F,) = x, Pr(wife is F 2 ) = Pr(husband is F 2 ) = 1-x, Pr(wife is F, and husband is F,) = x, 2 + z n Pr(wife is F, and husband is F 2 ) = Pr(wife is F 2 and husband is F,) = x,(l-x ) - z n Pr(wife is F 2 and husband is F 2 ) = (1-x,) 2 + z,, [Zx i (l-b,)]-zzz«b i a j } i-i i-ij-i i-ij-i Clearly, this expression differs from the population odds ratio (1); specifically, it has an extra term subtracted from both numerator and denominator. Unlike the population odds ratio, the estimate is affected by the correlation parameters z~. Thus, the usual matchedpairs estimate will be biased. We now examine the nature of the bias. (2) There will be four disease parameters (a,, a^ b v b 2 ). Assume that F = 2 is the higher risk category for both exposed and unexposed, so that a 2 &a, and b 2 3>b. Assume also that exposure is detrimental in both categories of F, so that b, 3= a, and b 2 ^ a 2. Assuming all this, we conducted a numerical search through the region where disease risks (a,, a^ b,, b 2 ) are small (KT 6 to 0.1) and relative risks (,^.) i D l flj are moderate (1-5). The parameters x, and Z, were allowed to range through all possible values. The search yielded no example where expression (2)

4 EFFECT OF WTTHIN-PAIR CORRELATION 423 differed by more than 3% from the population odds ratio. In this particular case, expression (2) is an increasing function of z u. Thus positive correlation within the pair (z n >0) will produce a value for expression (2) greater than the population odds ratio. Conversely, negative correlation will produce a value which is smaller. DISCUSSION We have shown that correlation within matched casecontrol pairs on unmeasured risk factors independent of the measured exposure can cause the usual estimate to be inconsistent for the population odds ratio. The bias vanishes as disease becomes rare; thus the bias is unlikely to be of practical importance. There is no bias if the exposure of interest is not a risk factor. There is also no bias if the unmeasured risk factor is not truly a risk factor or if it is not correlated within pairs. We assume throughout that the quantity of interest is the population odds ratio. This will not always be the situation. For example, if the unmeasured risk factor is genotype, only the risk among the susceptibles may be of interest. 7 ' 8 We assume that disease is independent within pairs, conditional on the risk factors; for non-infectious diseases, this is likely to be true since any correlation of disease is probably induced by correlation of risk factors. We assumed that marginal distribution of unmeasured risk factors was the same for the two pair members; situations where this is not true (for example, spouses of breast cancer cases) are likely to represent problematic choices of controls. Related but different problems have been discussed by other authors. Khoury and James 6 assume a measured environmental factor and an unmeasured genetic factor, but examine a different study design. They identify affected individuals and determine the disease status of the pair member. They calculate risk of disease in one pair conditional on the other pair member being diseased and conditional on exposure status. In contrast, the matched-pair case-control study examined here looks at risk of exposure in a pair conditional on the disease status of the pair. They show that the relative risks they obtain will equal the population relative risk if risks are multiplicative. Robins and Pike 5 discuss the situation of two risk factors in matched-pair case-control studies. However, they assume that both risk factors E and F are measured and the effects of both are estimated simultaneously. This is a different estimator from the one discussed here. They assume that E and F are correlated with each other. They show that if risks are multiplicative, the estimates for both risk factors will be unbiased. ACKNOWLEDGEMENTS I thank Dale Sandier for bringing this problem to my attention and Glinda Cooper, Dale Sandier, David Umbach, and Clarice Weinberg for helpful comments. REFERENCES 1 Goldstein A M, Hodge S E, Haile R W C. Selection bias in case-control studies using relatives as the controls. Int J Epidemiol 1989; 18: Pike M C, Robins J. Re: 'Possibility of selection bias in matched case-control studies using friend controls'. Am J Epidemiol 1989; 130: Flanders W D, Austin H. Possibility of selection bias in matched case-control studies using friend controls. Am J Epidemiol 1986; 124: Austin H, Flanders W D, Rothman K J. Bias arising in casecontrol studies from selection of controls from overlapping groups. Int J Epidemiol 1989; 18: Robins J, Pike M. The validity of case-control studies with nonrandom selection of controls. Epidemiology 1990; 1: Khoury M J, James L M. Population and familial relative risks of disease associated with environmental factors in the presence of gene-environment interaction. Am J Epidemiol 1993; 137: Khoury M J, Stewart W, Beaty T H. The effect of genetic susceptibility on causal inference in epidemiologic studies. Am J Epidemiol 1987; 126: ' Breitner J C S, Murphy E A, Woodbury M A. Case-control studies of environmental influences in diseases with genetic determinants, with an application to Alzheimer's disease. Am J Epidemiol 1991; 133: (Revised version received August 1995)

5 424 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY APPENDIX We give here details of some of the calculations. Note first for future reference that symmetry in the definitions of the probabilities of F imply that z^ = z-^. The fact that probabilities add to 1 implies that Z x, = ' Th e definition of x implies that z H = Xz ii =0 I.I j.i First, derive the risk of disease conditional on exposure: Pr(D I E) = Pr(E) Pr(R) Pr(D I E.R) / Pr(E) = X;b, The derivation of Pr(D E) is exactly analogous; population values of relative risks and odds ratios follow immediately. The expected number of pairs with an exposed case and an unexposed control will be proportional to: r r Pr(wife is E, D and husband is E, D) + Pr(wife is E, D and husband is E, D) = X Z[Pr(wife is E, D, F ; and husband is E, D, Fp '"' H + Pr(wife is E, D, Fj and husband is E, D, Fp] r r z I I [Pr(wife is E and husband is E) Pr(wife is F and husband is F),-i j.i i Pr(wife is D wife is E, F,) Pr(husband is D husband is E, Fp + Pr(wife is E and husband is E) Pr(wife is F, and husband is Fp Pr(wife is D wife is E\ Fj) Pr(husband is D husband is E, Fp] = {[p(l-q)-r](x 1 x j + z,pb,(l-ap + [(l^q-rkxjxj + ZyXl-a,)^} = [ P (l-q)-r] x,b, x J (l-a J ) + [(l-p)q-r] x 1 (l-a i ) x J b J +[p(l-q)-r] z, J b 1 (l-a J ) I.I j.i I-I j.i i-ij.i + [(l-p)q-r)] z ij (l-a i )b J i.ij.i = [p + q-2pq-2r][ x l (l-a,)][ x,b 1 ] + [p(l-q)-r] z lj b i (l-a j ) + [(l-p)q-r] z jl b J (l-a 1 ) i.i i.i i-ij-i I.IJ.I = [ P + q-2pq-2r]{[ x,(l-a 1 )][ x l b 1 ]+ z u b 1 (l-a J )} I-I i.i I-IJ-I = [p + q-2pq-2r]{[ x l (l-a i )][ x,b i ]+ b i z ij - z, J b 1 a j } i-i I.I i.i j.i i-ij-i = [p + q-2pq-2r]([ x,(l-a,)][ x i b 1 ]- z, J b i a j } i-i I.I i-ij-i The expected number of pairs with an unexposed case and an exposed control can be derived similarly, and expression (2) follows immediately. Expression (2) will be equal to the population odds ratio (1) in several circumstances. First, there is no bias if a; = b, for all i. Under this assumption, the numerator of (2) equals the denominator of (2), so expression (2) equals 1. Since the population odds ratio is also 1, there is no bias. Second, there is no bias if z V) = 0 for all i and j. Under this condition, the extra term in the numerator and denominator of (2) is zero; this makes expressions (1) and (2) exactly equal. Third, there is no bias if a; = a for all i. Under these circumstances, the extra term is again zero: r f f r Y. Z z^b^i = a Z b, I z. = 0. i.i J-I 'J ' J I.I ' I.I 'J Thus there is no bias. In similar fashion, if bj = b for all i, the extra term is again zero.

6 EFFECT OF WITHIN-PAJR CORRELATION The behaviour of expression (2) in the rare disease case can be seen by letting disease rates go to zero with relative rates fixed. Let b, = ^a, and a, = s^. Expression (2) becomes 1=1 1-1 i-i i.i,-ij.i ' ' /{[Ix 1 s 1 HIx i ]-a 0 [Ix 1 s,][ix i iis l ]-a 0 z lj i;s l s J } i.i i.i I-I 1=1 i-ij-i We need the limit of this expression as ag goes to zero with all other terms fixed. Simple calculus shows that the limit is: r r f r [Xx^sJ / [Xx.s,] = [ X b,] / [ x,a,] =relativerisk i.i I-I I-I I-I Thus as the disease becomes rare, the large-sample expectation approaches the population relative risk.

