Logistic Regression Model for Analyzing Extended Haplotype Data

Size: px

Start display at page:

Download "Logistic Regression Model for Analyzing Extended Haplotype Data"

Cecily Esther Fox
6 years ago
Views:

1 Genetic Epidemiology 15: (1998) Logistic Regression Model for Analyzing Extended Haplotype Data Sylvan Wallenstein, 1 * Susan E. Hodge, 3 and Ainsley Weston 2 1 Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, New York 2 Department of Community Medicine, Mount Sinai School of Medicine, New York, New York 3 New York State Psychiatric Institute and Department of Psychiatry, Columbia University, New York, New York Recently, there has been increased interest in evaluating extended haplotypes in p53 as risk factors for cancer. An allele-specific polymerase chain reaction (PCR) method, confirmed by restriction analysis, has been used to determine absolute extended haplotypes in diploid genomes. We describe statistical analyses for comparing cases and controls, or comparing different ethnic groups with respect to haplotypes composed of several biallelic loci, especially in the presence of other covariates. Tests based on cross-tabulating all possible genotypes by disease state can have limited power due to the large number of possible genotypes. Tests based simply on cross-tabulating all possible haplotypes by disease state cannot be extended to account for other variables measured on the individual. We propose imposing an assumption of additivity upon the haplotype-based analysis. This yields a logistic regression in which the outcome is case or control, and the predictor variables include the number of copies (0, 1, or 2) of each haplotype, as well as other explanatory variables. In a case-control study, the model can be constructed so that each coefficient gives the log odds ratio for disease for an individual with a single copy of the suspect haplotype and another copy of the most common haplotype, relative to an individual with two copies of the most common haplotype. We illustrate the method with published data on p53 and breast cancer. The method can also be applied to any polymorphic system, whether multiple alleles at a single locus or multiple haplotypes over several loci. Genet. Epidemiol. 15: , Wiley-Liss, Inc. *Correspondence to: Sylvan Wallenstein, Box 1023, Department of Biomathematical Sciences, Mount Sinai School of Medicine, 1 Gustave Levy Place, New York, NY wallenst@msvax.mssm.edu Received 10 April 1997; Revised 18 June 1997; Accepted 19 June Wiley-Liss, Inc.

2 174 Wallenstein et al. Key words: case-control studies; association studies; disease-marker associations; HLA INTRODUCTION Recently, there has been increased interest in evaluating haplotypes rather than alleles at individual loci as risk factors for disease. For example, the inheritance of three biallelic polymorphisms of p53 has been proposed as a risk factor for various cancers including colorectal cancer [Själander et al., 1995a], and breast cancer [Själander et al., 1996]. Själander et al. [1995b] conclude their paper by noting that... extended haplotypes would be more informative [than individual alleles at individual loci] in studies of population differences and associations between p53 germline mutations and cancer. In these reports, the frequencies of the three-locus extended haplotypes were not actually physically measured but were obtained by first estimating the frequency of pairwise haplotypes for each of the three combinations using the frequency distribution in homozygotes, and then estimating three-way combinations from these two-way frequencies. Recently, Weston et al. [1997] developed an allele-specific PCR protocol to directly determine absolute extended p53 haplotypes in diploid genomes based on an allele-specific PCR, confirmed by restriction analysis. They compared breast cancer cases with controls in three different ethnic groups with respect to these extended haplotypes. We propose a logistic regression procedure for evaluating differences between two or more groups with respect to extended haplotype distributions and show how this has advantages over simple crosstabulation of genotypes or haplotypes. The methodology is similar to that described by Smouse and Williams [1982] in a slightly different context, but the current paper also describes optimality properties of the test, discusses estimation in addition to testing, and allows for the presence of additional predictor variables, thus paving the way for studies of gene-environment interactions. We discuss application to both cohort studies (e.g., different ethnic groups) and case-control studies. METHODS We denote a haplotype based on m biallelic marker loci as x, where each x is a m 1 row vector composed of 1 s and 2 s. For concreteness and ease of exposition, we shall let 1 denote the more common allele or major allele at each locus. Thus for example, for m = 3, there are eight possible haplotypes: (1-1-1, 1-1-2, 1-2-1, , 2-1-1, 2-1-2, 2-2-1, 2-2-2). Let H denote the number of haplotypes under consideration. The maximum value of H is H = 2 m ; however, in a particular example H may be smaller than 2 m, either because not all haplotypes occur biologically or are observed, or because we choose to combine two or more rarely observed haplotypes into a category designated other. We denote the genotype, i.e., the pair of haplotypes for an individual, as (x;y). For example, when m = 3, an individual with the and haplotypes would be denoted x = (1,1,2), y = (1,2,1). The number of genotypes under consideration is denoted by G. The maximum value of G, given H, is G = H(H + 1)/2, but a smaller number may occur either because a certain genotype is lethal, or because it did not

3 Logistic Regression for Extended Haplotypes 175 happen to occur in the observed sample. In general, G can be quite large, taking a maximum value of 2 m 1 (2 m + 1); but in particular applications it can be appreciably smaller, though still considerably larger than H. We first describe two cross-tabulation procedures for analyzing these data, one genotype based and the other haplotype based. We then present our proposed logistic regression method. The genotype-based cross-tabulation method is based on a G 2 table recording the n patients according to genotype and disease status (case or control). In this unstructured model, no relationship is postulated among the G different values of π(x;y) P(Disease haplotype x and haplotype y). The null hypothesis, H 0G : π(x;y) = constant, can be tested by the ordinary Pearson chi-square statistic with G 1 degrees of freedom. Additional variables, such as smoking in lung cancer studies, could be added to the model using logistic regression [Sugimura et al., 1994]. However, especially for m > 2, due to the large value of G, this χ 2 statistic could lack power if the difference between groups is due to a haplotype rather than just one genotype. Therefore we discuss two haplotype-based procedures, one based on a cross-tabulation and the other on logistic regression, that reduce the number of parameters involved in the test of no genetic effect from G 1 to H 1. Both these procedures test the same hypothesis H 0G as above but, implicitly or explicitly, make an assumption of additivity on some scale. To apply these asymptotic procedures, we would strive to retain in the analysis haplotypes that occurred in a minimum of five individuals, and combine all the rest into an other group, which should itself also contain at least five individuals. The first haplotype-based procedure, performed for example by Weston et al. [1997], tabulates the total number of times each haplotype occurs in the sample, counting each haplotype twice for homozygotes and once for heterozygotes, for a total of 2n haplotypes in a sample of n individuals. (This is the method of allele counting; see, e.g., Cavalli-Sforza and Bodmer [1971] or Edwards [1992].) We will refer to this method as the haplotype-based cross-tabulation, in which the H different haplotypes are cross-tabulated by the two levels of disease status (e.g., case and control). The hypothesis of no association between disease and haplotype is tested conventionally using the Pearson chi-square statistic for H 2 tables. Since the test is based on 2n observations, it would appear to be difficult to extend this analysis to other covariates measured on an individual level. Our proposed method, which we term haplotype-based logistic regression, uses logistic regression rather than cross-tabulation, and is based on the number of copies (0, 1, or 2) of the H haplotypes each individual has. It imposes an explicit assumption of additivity on the logit scale. The logit of each heterozygote is assumed to be halfway between the logits of the two corresponding homozygotes. We can express this assumption via one of two parameterizations that, while mathematically equivalent, shed different light on the nature of the assumptions. The first, the homozygous parameterization of the model, useful in a cohort study, represents all G values of π(x;y) in terms of the H probabilities π(x;x) of

4 176 Wallenstein et al. disease occurrence in homozygotes. Rather than employ these H parameters directly, we transform them to a logit scale, defining θ(x) such that π(x;x) ln 1 π(x;x) = 2θ(x), or π(x;x) = e 2θ(x) /[1+e 2θ(x) ]. The assumption of additivity mentioned above is expressed by ln π(x;y) = θ(x) + θ(y), (1) 1 π(x;y) or equivalently, exp[θ(x) + θ(y)] π(x;y) = 1 + exp[θ(x) + θ(y)]. The null hypothesis, H 0H : θ(x) = constant, for all x, is evaluated in terms of the difference between the log-likelihood for the model with H parameters and no intercept, and the log likelihood of the model under H 0 with only a single (intercept) term. For n large, under H 0H, twice this difference has a chi-square distribution with H 1 degrees of freedom [Hosmer and Lemeshow, 1989]. Note that H 0H together with (1) is equivalent to H 0G. In a cohort study, 2θ(x) is the log odds of disease for individuals homozygous for haplotype x. The second, the baseline-haplotype parameterization, selects for reference a baseline haplotype, z, as the one we want to use as a basis for comparison. (This can be arbitrary, but generally will be the most common haplotype.) The parameterization is particularly appropriate for a case-control study, or when there is a major haplotype (perhaps occurring in over 80% of all individuals). The model expresses all G values of π(x;y) in terms of the H probabilities π(x;z), or equivalently in terms of an intercept, α, and H 1 coefficients β(x), where π(x;z) ln 1 π(x;z) = { α x = z (2) α + β(x), otherwise. Each parameter β(x) gives the log odds ratio for disease for an individual with a single copy of the haplotype x and another copy of the base haplotype, relative to an individual with two copies of the base haplotype. The assumption of additivity states that for x,y z π(x;y) ln 1 π(x;y) = α + β(x) + β(y). (3) The null hypothesis H* 0H : β(x) = 0, all x, would be evaluated by twice the difference between the log-likelihood for the model with H parameters and the log likeli-

5 Logistic Regression for Extended Haplotypes 177 hood of the model under H 0, i.e., with only a single (intercept) term. Setting α = 2θ(z), β(x) = θ(x) θ(z), for x z, it can be seen that H* 0H = H 0H. Alternatively, it is possible to reject H 0H based on the maximum Z statistic noted for comparing any haplotype to the baseline haplotype. The test statistic is max x Z(x) where Z(x) = β^ (x))/s.e.(β^ (x)), and β^(x) is the maximum likelihood estimate of β(x). Under the null hypothesis, Z(x) has an asymptotic standardized normal distribution. To find a conservative estimate of the P value associated with the maximum over H 1 values of x, use the Bonferroni correction to multiply the smallest P value associated with Z(x), by H 1. Conceptually, it is possible to test the additivity assumptions (1) or (3) by taking the difference between the log likelihoods of the genotype model with G 1 parameters, and the one according to equation (1) or (3) with H 1 parameters. If the proposed additivity assumption is true, twice the difference in log likelihoods, or more simply, the difference between the chi-square statistic used to test H 0G and the one used to test H 0H, will asymptotically have a chi-square distribution with G H degrees of freedom. This lack-of-fit test is included in many statistical software programs. However, asymptotic results based on G H degrees of freedom are questionable if some of the G genotypes are noted in only three or fewer patients. The haplotype-based logistic regression retains the identity of the particular genetic configuration and of the individual and can thus include demographic information, environmental risk factors, or environment-gene interactions, by simply adding them into the regression. To test whether there is a common non-zero effect of a collection of k such variables (k 1), after adjusting for genetic effects, compare twice the difference in log likelihood between a model containing such effects and a model that does not contain them, to a χ 2 distribution with k degrees of freedom. The presence of gene-environment interaction is most easily evaluated in a model with an intercept (i.e., the second parameterization), by first constructing H 1 new variables by multiplying each of the H 1 haplotype counts by the variable indicating environmental exposure, and then examining the difference in log likelihoods as described above. EXAMPLE We illustrate the application of these methods for the 182 Caucasian women in the case-control study of breast cancer described by Weston et al. [1997]. There were m = 3 biallelic loci, so the maximum number of possible haplotypes is 2 3 = 8, although only six were observed in the full sample of 284 women in all three racial groups studied. The most frequently occurring haplotype is designated 1-1-1; the second most prevalent was 2-1-1; but as evidence of linkage disequilibrium, the third was In Caucasians, only two additional haplotypes were observed: in four individuals, and in a single (case) patient. Thus we chose to collapse these two categories into an other category for the statistical analysis, so that H = 4, and G = H(H + 1)/2 = 10. We will now illustrate analysis of these data by the different methods discussed here. Genotype-Based Cross-Tabulation For the 182 Caucasian women, Table I cross-tabulates the G = 10 genotypes by disease status (case or control). The ordinary Pearson chi-square statistic with nine

6 178 Wallenstein et al. TABLE I. Cross-Tabulation of Genotype by Disease Group* Genotype Number of outcomes Number of haplotypes pattern Cases Controls Other Total *In pattern 7, other = 2-2-1; for all other patterns, other refers to As previously noted, the nomenclature is not consistent with Själander et al., or other previous papers. The first and third indices are switched, so what we call Själander et al. [1995a,b; 1996] call 2-1-2, our is their 1-1-2, and our is their degrees of freedom is χ 2 = (P = 0.194). Alternatively, the chi-square statistic based on the difference in log likelihood is 13.97, also with nine degrees of freedom. Haplotype-Based Cross-Tabulation This analysis is based on the 364 haplotypes, rather than 182 individuals. Table II cross-tabulates the four haplotypes under consideration by disease group. For this table, Pearson s χ 2 = 7.03, with three degrees of freedom, so that P = Thus, the P value obtained by this haplotype-based cross-tabulation (and as we note below, for the haplotype-based logistic regression) was smaller than that obtained by the genotype-based analysis. Haplotype-Based Logistic Regression Logistic regression is usually performed entering data on each individual separately, with the outcome case or control. In a model evaluating genetic effects only, the covariates are the number of haplotypes (0, 1, or 2) for each haplotype pattern. In our example with H = 4, either three haplotypes are used with an intercept, or the intercept is suppressed and all four are entered. Applied to our example, the procedure yields a chi-square statistic of 6.91 with three degrees of freedom, so that P = We used JMP (SAS Institute) for the calculations, but any conventional logistic regression program will suffice. If the data had come from a cohort study, one could have used the homozygous parameterization to estimate the probability of disease for any specific geno- TABLE II. Cross-Tabulation of Haplotype by Disease Group Other Total Cases 89 (68%) 14 (11%) 24 (18%) 3 (2%) 130 Controls 183 (78%) 26 (11%) 22 (9%) 3 (1%)

7 TABLE III. Parameter Estimates Obtained by Logistic Regression Logistic Regression for Extended Haplotypes 179 Parameter Estimate se(est.) Homozygous parameterization -no intercept θ (1,1,1) θ (2,1,1) θ (2,2,2) θ (other) Baseline-haplotype parameterization with the (1-1-1) haplotype as the base haplotype Intercept β (2,1,1) β (2,2,2) β (other) type. These estimates can be derived from the output as given in the top panel of Table III. For example, for an individual with two copies of the haplotype, the probability of disease would be estimated as exp( )/[1 + exp( )] = Similarly, for an individual with one copy of and one copy of the probability of disease is estimated as exp( )/[1 + exp( )] = For a case control study, these probabilities are not estimable [Prentice and Pyke, 1979], so the baseline-haplotype parameterization would be appropriate. The bottom panel of Table III gives the estimates for that parameterization. For example, the estimate for β(2,2,2) indicates that a genotype with x = (2,2,2), y = (1,1,1) is estimated to multiply the risk of cancer by exp(0.856) = 2.35, as compared to two copies of the baseline haplotype, The 95% confidence interval for this relative risk (ignoring issues of simultaneous testing) is exp[0.856 ± 1.96(.339)], or 1.21 to The test for the assumption of additivity has G H = 10 4 = 6 degrees of freedom and the χ 2 statistic is given by = 7.06, thus giving us no reason to question the assumption of additivity. However, since three of the ten genotypes were noted in only one individual, and two other genotypes were noted in only two individuals, it is important to stress that this test merely fails to contradict the assumption of additivity; it does not confirm the assumptions. DISCUSSION The logistic regression method presented here is straightforward, is commonly used in epidemiologic calculations, lends itself to interpretation of the magnitude of effects, is easily extended to control for other variables, and gives an overall test with fewer degrees of freedom than the model based on all genotypes. Extension to several groups, e.g., comparing the three racial groups in our example, or subdividing controls into benign breast disease and other controls, is also straightforward. It is also straightforward to incorporate multiple alleles at one or more loci. The impetus for proposing this logistic regression arose from the breast cancer work referred to above, and from the fact that with the allele-specific PCR protocol, it is now possible to directly determine haplotypes in some cases. However, the method could equally well be applied to any polymorphic system, whether multiple alleles at a single locus and/or multiple haplotypes over several loci. Smouse and Williams [1982] propose a general framework for constructing multivariate tests in this situa-

8 180 Wallenstein et al. tion. (We discuss only their one-locus case, since that corresponds to our situation, in which haplotypes can be determined directly.) In particular, they suggest allelic counting as a scoring convention and point out that this gives double weight to homozygotes. (Compare method of allele counting; see above). These allele counts then yield H 1 genetic score variables, which are essentially identical to the input to our logistic regression using the second parameterization. They then construct a multivariate test statistic (their equation (5)), which they show is identical to a haplotype-based contingency-table χ 2. Smouse and Williams [1982] also mention the possibility of using log-linear models, similar to the logistic regression approach proposed here, but they do not elaborate on this procedure in the context of comparing different groups (e.g., cases vs. controls), as we do here. Not only do we address a different application, we also focus on the nature of the association between genotype and disease that would make the haplotype-based tests optimal; we exploit the logistic model to estimate parameters of interest; and we discuss more general models, which include covariates. It is also interesting to note that the haplotype-based cross-tabulation, although it may appear assumption-free, actually implicitly incorporates an additivity assumption similar (although not identical) to that incorporated in the logistic regression. This arises because the haplotypes are counted up as they appear in cases and controls, so that homozygotes are given twice the weight given to heterozygotes, as also pointed out by Smouse and Williams [1982]. In another procedure used to test for HLA-disease associations, Meddeb-Garnaoui et al. [1995] performed H separate evaluations comparing the proportion of individuals with each haplotype in cases and controls. The procedure does not distinguish between homozygotes and heterozygotes with respect to the haplotype in question. Even without an adjustment for multiplicity of tests, the procedure can give very large confidence intervals for the odds ratios and can obscure any relationship between a particular haplotype and disease, if such a relationship exists. If one knew the exact mechanism of disease causation (e.g., an extension of classical dominant or recessive mode of inheritance), other methods of statistical analysis would be more appropriate. However, in the absence of this knowledge, our procedure gives a test that is consistent (power for a fixed alternative approaches 1.0, as sample size increases) for a large range of patterns, including situations in which either a single copy or two copies of a certain haplotype are increased with disease. In the example given, the P values for the haplotype-based tests were similar and were smaller than those for the genotype based tests. We conjecture that this might be true in general. We have shown that our test is optimal under certain conditions and also indicated similarity with other haplotype based tests discussed by Smouse and Williams [1982]. Thus, when these conditions are satisfied, or nearly satisfied, we would expect the haplotype-based procedures to give smaller P values than the genotype based procedures. Lastly, we note that in the limiting case of a biallelic marker at a single locus (m = 1) without covariates, the procedures based on cross-tabulation yield tests that are commonly used, whereas our logistic regression procedure results in a test statistic that can be viewed as a compromise between a model designed to detect a dominant relationship and one for a recessive relationship: The genotype-based

9 Logistic Regression for Extended Haplotypes 181 cross-tabulation yields a 3 2 table cross-tabulating, for n subjects, the three possible genotypes against disease status. The haplotype-based cross-tabulation reduces to a 2 2 cross-tabulation for the 2n alleles, of allele present or absent by disease group. Our proposed method uses logistic regression to find the relationship between the outcome status (case or control) measured in n individuals, and the covariate: number of copies of the allele (0, 1, or 2). In contrast, a test procedure optimal to detect a dominant disease would use as covariate 0 if there were no copies of the allele, and 1 if there were one or more copies; while one optimal to detect a recessive disease would use as covariate 1 if there were two copies of the allele, and 0 otherwise. Our coding scheme is somewhere in between, and is optimal for some intermediate level of penetrance, but is consistent for both a dominant and recessive model. ACKNOWLEDGMENTS This work was supported by NIH grants RR 00071, MH-48858, MH-52841, MH-28274, MH-36197, DK-31813, and CA/ES REFERENCES Cavalli-Sforza LL, Bodmer WF (1971): The Genetics of Human Populations. San Francisco: WH Freeman, p 43. Edwards AWF (1992): Likelihood, expanded edition. Baltimore: Johns Hopkins, p 19. Hosmer DW, Lemeshow S (1989): Applied Logistic Regression. New York: John Wiley and Sons, p 32. Meddeb-Garnaoui A, Zeliszewski D, Mougenot JF, Djilali-Saiah I, Caillat-Zucman S, Dormoy A, Gaudebout C, Tongio MM, Baudon JJ, Sterkers G (1995): Reevaluation of the relative risk to susceptibility to celiac disease of HLA-DRB1, -DQA1, -DQB1, -DPB1 and -TAP2 alleles in a French population. Hum Immunol 43: Prentice RL, Pyke R (1979): Logistic regression incidence models and case control studies. Biometrika 66: Själander A, Birgander R, Athlin L, Stenling R, Rutegard J, Beckman L, Beckman G (1995a): P53 germ line haplotypes associated with increased risk for colorectal cancer. Carcinogenesis 16: Själander A, Birgander R, Kivelä A, Beckman G (1995b): P53 polymorphisms and haplotypes in different ethnic groups. Hum Hered 45: Själander A, Birgander R, Hallmans G, Cajander S, Lenner P, Athlin L, Beckman L, Beckman G (1996): P53 polymorphisms and haplotypes in breast cancer. Carcinogenesis 17: Smouse PE, Williams RC (1982): Multivariate analysis of HLA-disease associations. Biometrics 38: Sugimura H, Suzuki I, Hamada GS, Iwase T, Takahashi T, Nagura K, Iwata H, Watanabe S, Kino I, Tsugane S (1994): Cytochrome p-450 1A1 genotype in lung cancer patients and controls in Rio de Janeiro, Brazil. Cancer Epidemiol Biomarkers Prev 3: Weston A, Pan C, Ksieski B, Wallenstein S, Berkowitz G, Tartter P, Bleiweiss I, Brower S, Senie R, Wolff M (1997): p53 haplotype determination in breast cancer. Cancer Epidemiol Biomarkers Prev 6:

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,