Combining dependent tests for linkage or association across multiple phenotypic traits

Size: px

Start display at page:

Download "Combining dependent tests for linkage or association across multiple phenotypic traits"

Rosemary Christal Crawford
6 years ago
Views:

1 Biostatistics (2003), 4, 2,pp Printed in Great Britain Combining dependent tests for linkage or association across multiple phenotypic traits XIN XU Program for Population Genetics, Harvard School of Public Health, 667 Huntington Ave, Boston, MA 02115, USA LU TIAN, L. J. WEI Department of Biostatistics, Harvard School of Public Health, 667 Huntington Ave, Boston, MA 02115, USA SUMMARY A robust statistical method to detect linkage or association between a genetic marker and a set of distinct phenotypic traits is to combine univariate trait-specific test statistics for a more powerful overall test. This procedure does not need complex modeling assumptions, can easily handle the problem with partially missing trait values, and is applicable to the case with a mixture of qualitative and quantitative traits. In this note, we propose a simple test procedure along this line, and show its advantages over the standard combination tests for linkage or association in the literature through a data set from Genetic Analysis Workshop 12 (GAW12) and an extensive simulation study. Keywords: Combining tests; Linkage analysis; Random effects models; Tests for association. 1. INTRODUCTION For studies of linkage and allelic association between phenotypic traits and genotypic markers, often several distinct traits, which may be closely related to a complex disease of interest, are available from each study subject. The key question is how to utilize such multivariate data efficiently in the analysis. Three different statistical approaches have been taken to handle this problem in the literature. The first one is to use the so-called random effects models to deal with the correlation among the phenotypic variables (see, for example, Laird and Ware, 1982; Korol et al., 1995; Jiang and Zeng, 1995; William, et al., 1999; Hackett, et al., 2001; Iturria and Blangero, 2000, Calinski et al., 2000; McCulloch and Searle, 2001). This novel approach, however, heavily depends on the model assumption, and does not have an efficient way to handle the case when the trait values are partially missing. The second approach is to create a single phenotypic variable with a linear combination of all the trait values from each subject, and then perform standard univariate analyses (Amos et al., 1990; Amos and Laing, 1993). To locate the optimal linear combination, however, one needs to use rather computing-intensive grids search methods. It is difficult if not impossible, to obtain the bona fide p-value for the resulting optimal test for testing linkage or association. Moreover, like the previous one, this approach may not be able to handle missing To whom correspondence should be addressed c Oxford University Press (2003)

2 224 XIN XU ETAL. observations efficiently either. The third approach is to combine individual test statistics or estimators obtained from the trait-specific univariate analysis for a global assessment on linkage or association (O Brien, 1984; Wei and Johnson, 1985; Liang and Zeger, 1986; Xu et al., 2001). The resulting procedures are nonparametric, which do not depend on the complex modeling. Furthermore, they can easily handle the problem when some trait values are missing completely at random and the case with a mixture of quantitative and qualitative traits. In this note, we consider the third approach for combining information across the multivariate traits, and present a quite simple modification of the existing test procedures for linkage or association. We used a data set from GAW12 to illustrate the new proposal. We also conducted an extensive numerical study to demonstrate that the modification greatly improves the power for detecting linkage or association between the phenotypic and genotypic markers over the conventional linear combination procedures. 2. COMBINING TEST STATISTICS Let T = (T 1,...,T K ) be a vector of K possibly correlated statistics, and each of them is obtained from a trait-specific univariate analysis. For example, T k is a test statistic for linkage or association solely based on the data from the kth trait, k = 1,...,K. For all the cases we encountered in practice, T is asymptotically normal with mean β = (β 1,...,β K ) and known (or consistently estimated) covariance matrix Σ. Suppose that we are interested in testing the hypothesis H 0 : β = 0, against a general alternative hypothesis H 1 : β k 0, k = 1,...,K, and there is at least one β>0. For this one-sided alternative, one may consider a class of linear combinations a T as test statistics, where a = (a 1,...,a K ) is a vector of possibly data-dependent weights. If β 1 = =β K, the linear combination with a = Σ 1 e is the most powerful test for testing H 0 against H 1, where e = (1,...,1) (O Brien, 1984; Wei and Johnson, 1985). The corresponding test statistic is e Σ 1 T. (1) A large value of (1) indicates that H 0 is not correctly specified. When T 1,...,T K are mutually independent, (1) simply n k=1 σk 2 T k, a well known procedure to combine information across K independent studies in meta analysis, where σ k is the standard error of T k. The test based on (1), however, may have low power against the alternatives that β are not clustered together. Now, consider the above testing problem for a simple alternative hypothesis that β = β 0 = (β 10,,β K 0 ), agiven vector in H 1. It is straightforward to show that the weight a of the best linear combination of T s for testing H 0 against this simple alternative hypothesis is Σ 1 β 0. The resulting test statistic is β 0 Σ 1 T. (2) Since we are interested in a test which is powerful against any β in H 1 (not just against a specific β 0 ) and T is expected to be a good estimate for β, it seems natural to replace β in (2) with T. This gives us a test statistic T 1 T, (3) which is chi-square distributed with K degrees of freedom under H 0.Although this test is omnibus with respect to H 1,itisnot very powerful against specific alternatives due to the heavy tail of the chi-square distribution. The problem with replacing β in (2) with T is that under the null hypothesis H 0, T converges to β = 0, therefore, the test statistic (3) is no longer a linear combination of T. The optimal test statistic (2) for testing against a general β can be rewritten as ( β1,..., β ) K Ɣ 1 Z, (4) σ 1 σ K

3 Combining dependent tests for linkage or association across multiple phenotypic traits 225 where Z = (Z 1,...,Z K ), Z k = T k /σ k, and Ɣ is the corresponding correlation matrix of. Ideally, we would like to have a test statistic which, under H 0, behaves like a linear combination of Z, but under the alternative H 1, whose kth weight is β k /σ k. This motivates us to consider the following simple test statistics: W (c) = (Z 1c,...,Z Kc )Ɣ 1 Z, (5) where Z kc = max{z k, c}, k = 1,...,K, and c is some given non-negative constant. For example, when c = 2, under H 0, Z k is the standard normal, therefore Z kc 2, W (c) is approximately a linear combination of Z. Onthe other hand, for a relatively large β k > 0 under the alternative hypothesis, Z kc is most likely to be Z k β k /σ k. Note that under H 0, the distribution of W (0) has a rather long tail, and the corresponding test performs like the omnibus test (3). On the other hand, W (4) e Σ 1 T, and its operating characteristics are similar to those of test (1). Therefore, there is no single c which would make the test W (c) optimal for a broad class of alternatives. Here, we present a simple procedure for testing H 0 against H 1, which chooses c in (5) automatically and objectively from a reasonably large interval [0,τ]. Note that W (c) is not normally distributed under H 0, but its null distribution can be estimated easily by generating M (a large number) independent Z = (Z 1,...,Z K ) from normal with mean 0 and variance covariance matrix Γ. For each realized Z, we compute W (c) as a function of c. This establishes a reference set D consisting of M realized {W (c); 0 c τ} under the null hypothesis. Now, let w(c) be the value of W (c) for the observed data, then its p-value p(c) = pr(w (c) >w(c)) can be estimated based on the reference set D. Letp m = min {0 c τ} p(c). A small p m suggests a rejection of H 0. Note that we essentially choose the test which gives the smallest p-value among all the tests {W (c), 0 c τ}. Naturally, p m is not the correct p-value for such a test. To establish the cut-off points of our test procedure, let P(c) and P m be the random counterparts of p(c) and p m, respectively. The null distribution of P m can be estimated by generating N (a large number) fresh independent Z from N(0, Γ). For each realized Z, we compute {W (c), 0 c τ}, and use the reference set D to figure out the corresponding P(c) and P m. The null distribution of P m can be estimated using those N realizations, and the bona f ide p-value pr(p m < p m ) of the test can then be estimated accordingly. Note that there is a parameter τ in the above proposal. Through an extensive numerical study, we find that the operating characteristics of the test are quite stable with τ 4. In practice, we recommend the test with τ = EXAMPLE Let us use an example of a linkage study to illustrate the above test procedure. Suppose that we are interested in detecting a linkage between a genetic marker and a complex disease trait with K intermediate phenotypic variables from each study subject. Haseman and Elston (1972) proposed a simple linear regression model for testing linkage with sib-pair observations and a single phenotypic variable per subject. Specifically, for the kth trait and a typical sib-pair, let X k and π be the squared trait difference and the mean genetic sharing identical by descent at the marker, respectively, k = 1,...,K. Then E(X k ) = α 1k β k π, (6) where E(X) is the expected value of X. Alarge value of the estimate for β k suggests a linkage between the kth trait and the marker. Drigalenko (1998) and Elston et al. (2000) suggested replacing the response variable in the Haseman and Elston model with the squared mean-corrected trait sum Y k for the sib-pair. That is, E(Y k ) = α 2k β k π. (7)

4 226 XIN XU ETAL. Table 1. Relative genetic variance components of traits Q1-5 for GAW12 simulated data set Q1 Q2 Q3 Q4 Q5 MG MG MG MG MG Table 2. p-values for univariate and combination tests at MG1 5 with GAW12 data Q1 Q2 Q3 Q4 Q5 SLC NEW MG MG MG MG MG Standard linear combination test. Note that these two models share the same regression coefficient β k. Recently, Xu et al. (2000) used the idea of Wei and Johnson (1985) and proposed a simple, unified estimation procedure for β k by linearly combining the regression coefficient estimates for the above two models. Let the resulting estimator be denoted by T k, k = 1,...,K. The covariance matrix estimate Σ for T can be obtained easily through some elementary probability arguments (Xu et al., 2001). Now, we apply our new test procedure for linkage to Problem 2 of GAW12 (Wijsman et al., 2001). The data for this problem were simulated for a common oligogenic disease with five intermediate quantitative traits Q1 5, which were generated through a random effects model with various combinations of five specific genetic loci MG1 5, and two environmental factors, gender and age. The relative contributions of MG1 5tothe total variance of Q1 5 are given in Table 1. Note that there is no single marker which regulates all five traits. For this simulated GAW study, there are 50 replicates of 23 large pedigrees from a general population. For illustration of our proposal, we randomly selected 500 independent sibling pairs from the above population. Prior to our linkage analysis using models (6) and (7), the trait values Q1 5 were adjusted with the two environmental factors, age and gender. Moreover, each adjusted trait value was standardized by subtracting off its empirical mean and dividing by its sample standard deviation. In Table 2, we report p-values for all the univariate tests based on T k for linkage (see Xu et al. (2000)), the standard linear combination (SLC) test based on (1) (O Brien, 1984; Wei and Johnson, 1985; Xu et al., 2001) and our new test at these five genetic loci. Compared with the SLC test, the new test gives substantially smaller p-values for all the markers MG1 5. Moreover, for MG3 which regulates three phenotypic traits, the p- value of the new test is uniformly smaller than those based on the univariate tests. All the p-values in this example are estimated based on M = N = 10 6 realizations with τ = 4. Each p m was obtained by minimizing p(c) over the interval [0, 4] by an increment of OPERATING CHARACTERISTICS FOR THE NEW TEST We conducted an extensive numerical study to examine if our proposal is more powerful than the most commonly used SLC test (1). To include a broad class of alternatives in our comparisons, we let β be a

5 Combining dependent tests for linkage or association across multiple phenotypic traits 227 Fig. 1. Scatter diagrams of 500 pairs of log-transformed p-values for the standard linear combination (SLC) and the new tests. (a) Correlation η = 0.3; (b) η = 0.7. vector of independent copies of the uniform random variable defined on the interval (0, 3).For each given β, wegenerate an observed vector T of K test statistics with a given covariance matrix Σ. The reference set D and the estimated null distribution of P m are based on M = N = 10 6 independent Z. The p-value of the SLC test is computed from the normal approximation. For all the cases we studied, the new test is almost uniformly better than its standard counterpart. In Figure 1, we present results for two scenarios with K = 5 and Σ being a correlation matrix, which has equal correlation η. The horizontal line of the plot is the log-transformed p-value for the SLC test, and the vertical line is the counterpart for the new test. Each dot in the figure represents a pair of log-transformed p-values for a specific alternative β generated as described above. Most dots are below the 45 line, indicating that the new test is more powerful than the standard one. For many cases, the improvements are quite substantial. For cases that the new test is not better than the standard one with respect to their p-values, the corresponding dots in the figure are quite close to the 45 line, indicating that there is no practical difference between these two tests. 5. REMARKS The proposed test procedure is an effective screening device for linkage and/or association studies in the presence of multiple traits, which are known to relate to a complex trait of interest. For a given genetic marker, it is very likely that not every trait is related to the marker. Our new test procedure objectively and automatically puts more weights on those traits, which have large observed correlations with the marker of interest. On the other hand, the standard linear combination test advocated by O Brien (1984) and Wei and Johnson (1985) does not utilize such valuable information for an overall evaluation for linkage or association. Our test can be generalized to the case with two-sided alternative hypotheses. For example, one possible modification is to replace Z k and Ɣ in (5) by Z k and the identity matrix, respectively: this results

6 228 XIN XU ETAL. in a test statistic k max{ Z k, c} Z k. When c = 0in(5) the resulting test is very similar to a chi-square test proposed by Lange et al. (2002) for the transmission/disequilibrium test with multiple phenotypic variables. Specifically, they considered the standard quadratic form T Σ 1 T as the test statistic. This Wald-type test is omnibus and is designed for testing against a general two-sided alternative. ACKNOWLEDGEMENTS We are grateful to an Associate Editor for insightful comments on the paper. GAW12 is supported by grant GM REFERENCES AMOS, C. I., ELSTON, R. C., BONNEY, G. E., KEATS, B. J. AND BERENSON, G.(1990). A multivariate method for detecting genetic linkage, with application to a pedigree with an adverse lipoprotein phenotype. American Journal of Human Genetics 47, AMOS, C. I. AND LAING, A. E.(1993). A comparison of univariate and multivariate tests for genetic linkage. Genetic Epidemiology 10, CALINSKI, T., KACZMAREK, Z., KRAJEWSKI, P., FROVA, C. AND SARI-GORLA, M.(2000). A multivariate approach to the problem of QTL localization. Heredity 84, DRIGALENKO, E.(1998). How sib pairs reveal linkage. American Journal Human Genetics 63, ELSTON, R.C., BUXBAUM, S., JACOBS, K.B.AND OLSON, J.M.(2000). Haseman and Elston revisited. Genetic Epidemiology 19, HACKETT, C. A., MEYER, R. C. AND THOMAS, W. T.(2001). Multi-trait QTL mapping in barley using multivariate regression. Genetics Research 77, HASEMAN, J.K.AND ELSTON, R.C.(1972). The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, ITURRIA, S. J. AND BLANGERO, J.(2000). An EM algorithm for obtaining maximum likelihood estimates in the multi-phenotype variance components linkage model. Annuals of Human Genetics 64, JIANG, C. AND ZENG, Z. B.(1995). Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140, KOROL, A. B., RONIN, Y. I. AND KIRZHNER, V. M.(1995). Interval mapping of quantitative trait loci employing correlated trait complexes. Genetics 140, LAIRD, N.M.AND WARE, J.H.(1982). Random-effects models for longitudinal data. Biometrics 38, LANGE, C., SILVERMAN, E., WEISS, S., XU, X. AND LAIRD, N. M. (2002). A multivariate transmission disequilibrium test. Biostatisticsin press. LIANG, K.Y.AND ZEGER, S.L.(1986). Longitudinal data analysis using generalized linear model. Biometrika 73, MCCULLOCH, C.E.AND SEARLE, S.R.(2001). Generalized, Linear, and Mixed Model. New York: Wiley. O BRIEN, P.C.(1984). Procedures for comparing samples with multiple endpoints. Biometrics 40, WEI, L. J. AND JOHNSON, W. E.(1985). Combining dependent tests with incomplete repeated measurements. Biometrika 72, WIJSMAN, E. M. et al. (2001). Analysis of complex genetic traits: Applications to asthma and simulated. Genetic Epidemiology Supp. 1, S1 S853.

7 Combining dependent tests for linkage or association across multiple phenotypic traits 229 WILLIAMS, J. T., VAN EERDEWEGH, P., ALMASY, L. AND BLANGERO, J.(1999). Joint multipoint linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formulation and simulation results. American Journal of Human Genetics 65, XU, X.,PALMER, L.J., HORVATH, S.AND WEI, L.J.(2001). Combining multiple phenotypic traits optimally for detecting linkage with sib-pair observations. Genetic Epidemiology Supp. 1, S148 S153. XU, X., WEISS, S., XU, X. AND WEI, L. J.(2000). A unified Haseman Elston method for testing linkage with quantitative traits. American Journal of Human Genetics 67, [Received March 5, 2002; first revision March 28, 2002; second revision June 18, 2002; accepted for publication June 23, 2002]

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by