Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center Republic of Korea 12 September 2012
nationalatlas.gov TM Where We Are STATES WASHINGTON CA NADA ALASKA ATLANTIC OCEAN HAWAII MONTANA NORTH DAKOTA L a k e S u p e r i o r PACIFIC OCEAN MAINE P A C I F I C O C E A N OREGON CALIFORNIA NEVADA IDAHO ARIZONA UTAH WYOMING COLORADO NEW MEXICO SOUTH DAKOTA NEBRASKA KANSAS OKLAHOMA MINNESOTA IOWA MISSOURI ARKANSAS WISCONSIN ILLINOIS MISSISSIPPI MICHIGAN a n L a k e M i c h i g INDIANA L a k e H u r o n KENTUCKY TENNESSEE ALABAMA OHIO L a k e E r i e GEORGIA L a k e O n t a r i o NEW YORK PENNSYLVANIA WEST VIRGINIA VIRGINIA NORTH CAROLINA SOUTH CAROLINA DC DELAWARE MARYLAND A T L A VERMONT NEW HAMPSHIRE CONNECTICUT NEW JERSEY MASSACHUSETTS RHODE ISLAND E A N N T I C O C TEXAS HAWAII A R C T I C O C E A N LOUISIANA R U S S I A P A C I F I C O C E A N 0 100 mi 0 100 km ALASKA CANADA ME G U L F O F M E X I C O FLORIDA BERING SEA XIC THE BAHAMAS GULF OF ALASKA O 0 100 200 0 100 200 300 km 300 mi P A C I F I C O C E A N 0 0 200 mi 200 km CU BA U.S. Department of the Interior U.S. Geological Survey O The National Atlas of the United States of America R states2.pdf INTERIOR-GEOLOGICAL SURVEY, RESTON, VIRGINIA-2003
Collaborators Nandita Mitra, PhD Department of Biostatistics and Epidemiology University of Pennsylvania Thomas P Cappola, MD Penn Cardiovascular Institute University of Pennsylvania Thomas Lumley, PhD Department of Statistics University of Auckland B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 3 / 31
Goals To integrate haplotypes into large association studies such that haplotype imputation is done once as a data-processing step Case-control studies (binary outcome) [French et al., 2006] Prospective studies (censored survival outcome) [French et al., 2012] To allow haplotype associations to be estimated in general-purpose statistical software (eg R) by researchers expert in the subject matter In world historical terms there is a lot to be said for keeping data analysis out of the hands of statisticians Thomas Lumley B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 4 / 31
Goals To integrate haplotypes into large association studies such that haplotype imputation is done once as a data-processing step Case-control studies (binary outcome) [French et al., 2006] Prospective studies (censored survival outcome) [French et al., 2012] To allow haplotype associations to be estimated in general-purpose statistical software (eg R) by researchers expert in the subject matter In world historical terms there is a lot to be said for keeping data analysis out of the hands of statisticians Thomas Lumley B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 4 / 31
Phase ambiguity Observed data is composed of a set of unphased genotypes Diplotype (pair of haplotypes) may be ambiguous; may not know which allele was transmitted from maternal or paternal chromosome Missing data problem; impute the unobserved diplotype Father A T Mother C G Child A C T G AT/CG or AG/CT B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 5 / 31
Haplotype imputation Expectation-maximization (EM) algorithm E: calculate expected phase given haplotype frequencies M: calculate MLEs for haplotype frequencies given phase Software: haplo.stats [Sinnwell and Schaid, 2012] Bayesian algorithm Observed genotype data combined with expected haplotype patterns Haplotypes estimated from posterior distribution Software: PHASE [Stephens and Donnelly, 2003] B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 6 / 31
Diplotype uncertainty Angiotensin II receptor type 1 (AGTR1) Haplotype Diplotype Label Haplotype frequency probability D TCCACGCATCTT 0.139 0.81 F TCTGTGCATCTC 0.290 C TCCACGCATCTC 0.034 0.19 G TCTGTGCATCTT 0.272 Rare TCCGCGCATCTC < 0.001 < 0.01 Rare TCTATGCATCTT < 0.001 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 7 / 31
Target of inference If diplotypes were known, could fit a standard regression model including diplotypes D i and environmental exposures Z i (possibly with interaction) Logistic regression: Y i = {1 = case, 0 = control} logit P[Y i = 1] = α + D i β D + Z i β Z + Cox regression: Y i = min(t i, C i ); event indicator δ i = 1[Y i = T i ] log λ i (t) = log λ 0 (t) + D i β D + Z i β Z + in which β = {β D, β Z } represent association parameters of interest We integrate the set of imputed haplotypes into these regression models B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 8 / 31
Estimating associations Non-iterative weighted estimation [French et al., 2006] 1. Impute haplotypes and estimate population haplotype frequencies 2. Create multi-record data for each individual Design matrix: set of diplotypes consistent with observed genotype, possibly including environmental exposures Weights equal to conditional probability of each diplotype Weight A B C D E F G H I Rare 0.81 0 0 0 1 0 1 0 0 0 0 0.19 0 0 1 0 0 0 1 0 0 0 0.01 0 0 0 0 0 0 0 0 0 2 3. Estimate associations using a weighted regression model Logistic regression for binary outcomes Cox regression for censored survival outcomes Robust or sandwich standard error estimator Account for uncertainty in phase B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 9 / 31
Estimating associations Non-iterative weighted estimation [French et al., 2006] 1. Impute haplotypes and estimate population haplotype frequencies 2. Create multi-record data for each individual Design matrix: set of diplotypes consistent with observed genotype, possibly including environmental exposures Weights equal to conditional probability of each diplotype Weight A B C D E F G H I Rare 0.81 0 0 0 1 0 1 0 0 0 0 0.19 0 0 1 0 0 0 1 0 0 0 0.01 0 0 0 0 0 0 0 0 0 2 3. Estimate associations using a weighted regression model Logistic regression for binary outcomes Cox regression for censored survival outcomes Robust or sandwich standard error estimator Account for uncertainty in phase B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 9 / 31
Estimating associations Non-iterative weighted estimation Estimates are obtained from estimating functions Logistic regression for binary outcomes n U(β) = π id (p id, β)x id [Y id expit X id β] i=1 d d(g i ) Cox regression for censored survival outcomes n U(β) = i=1 d d(g i ) l id (β) β l id (β) = δ id π id (p id, β) X id β log exp X i dβ i R(t i ) Inference based on univariable or multivariable Wald tests Estimator is valid and efficient under the null hypothesis B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 10 / 31
Estimating associations Weighted haplotype combination [Rebbeck et al., 2009] 1. Impute haplotypes and estimate population haplotype frequencies 2. Create single-record data for each individual Design matrix: weighted combination of diplotypes consistent with observed genotype, with weights equal to conditional probability of each diplotype, possibly including environmental exposures A B C D E F G H I Rare 0 0 0.19 0.81 0 0.81 0.19 0 0 0.01 3. Estimate associations using an unweighted regression model Logistic regression for binary outcomes Cox regression for censored survival outcomes Account for uncertainty in phase B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 11 / 31
Estimating associations Weighted haplotype combination [Rebbeck et al., 2009] 1. Impute haplotypes and estimate population haplotype frequencies 2. Create single-record data for each individual Design matrix: weighted combination of diplotypes consistent with observed genotype, with weights equal to conditional probability of each diplotype, possibly including environmental exposures A B C D E F G H I Rare 0 0 0.19 0.81 0 0.81 0.19 0 0 0.01 3. Estimate associations using an unweighted regression model Logistic regression for binary outcomes Cox regression for censored survival outcomes Account for uncertainty in phase B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 11 / 31
Estimating associations Weighted haplotype combination Estimates are obtained from estimating functions Logistic regression for binary outcomes U(β) = n X i [Y i expit X i β] i=1 Cox regression for censored survival outcomes U(β) = n i=1 l i (β) β l i (β) = δ i X i β log i R(t i ) exp X i β Inference based on univariable or multivariable Wald tests Estimator is exactly valid under the null hypothesis B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 12 / 31
Simulation study In which situations do non-iterative methods provide reliable results? Angiotensin II receptor type 1 (AGTR1) Label Haplotype Frequency A ATTATGCATCTC 0.029 B ATTATGTGATCC 0.051 C TCCACGCATCTC 0.027 D TCCACGCATCTT 0.090 E TCTGTGCAACTT 0.029 F* TCTGTGCATCTC 0.223 G TCTGTGCATCTT 0.188 H TTTACACATCTC 0.038 I TTTACACATCTT 0.032 * Referent n = 500, 25% censoring (independent) B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 13 / 31
Simulation results: Moderate effects Approximate bias 2 1 0 1 2 Weighted estimation Weighted combination Known phase A 1/2.0 B 1/1.75 C 1.25 D 1.5 E 1/1.25 G 1/1.5 H 1.75 I 2.0 Hazard ratio Haplotype B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 14 / 31
Simulation results: Moderate effects % coverage of 95% confidence intervals Hazard Weighted Weighted Known Haplotype ratio estimation combination phase A 1/2.0 94 95 95 B 1/1.75 94 95 95 C 1.25 80 95 95 D 1.5 93 94 95 E 1/1.25 93 95 94 G 1/1.5 89 94 95 H 1.75 86 96 95 I 2.0 92 95 96 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 15 / 31
Simulation results: Moderate effects Rejection rate (%) of two-sided hypothesis tests Hazard Weighted Weighted Known Haplotype ratio estimation combination phase A 1/2.0 90 91 92 B 1/1.75 91 93 94 C 1.25 62 19 24 D 1.5 92 90 93 E 1/1.25 15 15 18 G 1/1.5 95 96 98 H 1.75 99 82 96 I 2.0 96 93 98 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 16 / 31
Simulation results: Strong SNP effects Approximate bias 2 1 0 1 2 Weighted estimation Weighted combination Known phase A 1.0 B 1.0 C 1.0 D 4.0 E 4.0 G 4.0 H 1.0 I 4.0 Hazard ratio Haplotype B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 17 / 31
Simulation results: Strong SNP effects % coverage of 95% confidence intervals Hazard Weighted Weighted Known Haplotype ratio estimation combination phase A 1.0 93 94 94 B 1.0 94 94 94 C 1.0 92 93 95 D 4.0 94 95 95 E 4.0 92 94 95 G 4.0 94 94 95 H 1.0 94 94 95 I 4.0 92 93 92 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 18 / 31
Simulation results: Strong SNP effects Rejection rate (%) of two-sided hypothesis tests Hazard Weighted Weighted Known Haplotype ratio estimation combination phase A 1.0 7 6 6 B 1.0 6 6 6 C 1.0 8 7 5 D 4.0 100 100 100 E 4.0 100 100 100 G 4.0 100 100 100 H 1.0 6 6 5 I 4.0 100 100 100 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 19 / 31
Simulation results: Strong non-snp effects Approximate bias 2 1 0 1 2 Weighted estimation Weighted combination Known phase A 1.0 B 1.0 C 4.0 D 1.0 E 4.0 G 4.0 H 4.0 I 1.0 Hazard ratio Haplotype B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 20 / 31
Simulation results: Strong non-snp effects % coverage of 95% confidence intervals Hazard Weighted Weighted Known Haplotype ratio estimation combination phase A 1.0 93 95 95 B 1.0 90 95 95 C 4.0 0 96 94 D 1.0 94 95 94 E 4.0 65 96 95 G 4.0 0 91 95 H 4.0 0 88 94 I 1.0 81 89 95 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 21 / 31
Simulation results: Strong non-snp effects Rejection rate (%) of two-sided hypothesis tests Hazard Weighted Weighted Known Haplotype ratio estimation combination phase A 1.0 7 5 5 B 1.0 10 5 5 C 4.0 37 100 100 D 1.0 6 5 6 E 4.0 100 100 100 G 4.0 100 100 100 H 4.0 83 100 100 I 1.0 19 11 5 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 22 / 31
Application HSPB7-CLCNKA haplotypes and adverse events in chronic heart failure Regulate renal potassium channels to control blood pressure SNP associated with heart failure in a large case-control study [Cappola et al., 2011] Genotypes available for 1150 genetically inferred Caucasians with heart failure enrolled in a prospective study 70% male; median age at study entry, 58 years 14 pre-selected SNPs Inferred 10 common haplotypes (frequency > 0.02) 65% had an unambiguous diplotype 90% had a highest posterior probability > 0.765 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 23 / 31
Application HSPB7-CLCNKA haplotypes and adverse events in chronic heart failure Outcome: time to all-cause mortality or cardiac transplantation Median follow-up, 3 years; maximum, 5 years 22% experienced an adverse event Cox regression framework Stratified by 4-level classification for disease severity Adjusted for gender, age, heart failure etiology, clinical site Time-varying covariate for age (exhibited non-proportional hazards) B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 24 / 31
Application results: Weighted estimation Label Haplotype Frequency HR (95% CI) P Q AGAGCGAGACGAGG 0.036 1.19 (0.80, 1.77) 0.39 R AGAGCGAGGGAAGG 0.160 1.04 (0.80, 1.35) 0.79 S AGAGCGGAGCAAGA 0.036 1.20 (0.80, 1.77) 0.38 T AGCGAGAGGCAAGA 0.066 0.55 (0.34, 0.88) 0.01 U GACGCGGAGCGCGG 0.063 0.80 (0.53, 1.20) 0.28 V GGAACAAGGGAAGG 0.037 0.49 (0.26, 0.92) 0.03 W GGAACAGAGCAAGA 0.299 Referent X GGAACAGAGCAAGG 0.048 1.36 (0.95, 1.95) 0.09 Y GGAGCAAGGCAAGG 0.050 1.15 (0.78, 1.69) 0.49 Z GGCGCGGAGCAAGG 0.031 1.09 (0.62, 1.92) 0.76 Overall 0.02 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 25 / 31
Application results: Weighted estimation Label Haplotype Frequency HR (95% CI) P Q AGAGCGAGACGAGG 0.036 1.19 (0.80, 1.77) 0.39 R AGAGCGAGGGAAGG 0.160 1.04 (0.80, 1.35) 0.79 S AGAGCGGAGCAAGA 0.036 1.20 (0.80, 1.77) 0.38 T AGCGAGAGGCAAGA 0.066 0.55 (0.34, 0.88) 0.01 U GACGCGGAGCGCGG 0.063 0.80 (0.53, 1.20) 0.28 V GGAACAAGGGAAGG 0.037 0.49 (0.26, 0.92) 0.03 W GGAACAGAGCAAGA 0.299 Referent X GGAACAGAGCAAGG 0.048 1.36 (0.95, 1.95) 0.09 Y GGAGCAAGGCAAGG 0.050 1.15 (0.78, 1.69) 0.49 Z GGCGCGGAGCAAGG 0.031 1.09 (0.62, 1.92) 0.76 Overall 0.02 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 25 / 31
Application results: Weighted combination Label Haplotype Frequency HR (95% CI) P Q AGAGCGAGACGAGG 0.036 1.19 (0.76, 1.87) 0.44 R AGAGCGAGGGAAGG 0.160 1.05 (0.81, 1.37) 0.73 S AGAGCGGAGCAAGA 0.036 1.20 (0.74, 1.94) 0.47 T AGCGAGAGGCAAGA 0.066 0.53 (0.32, 0.90) 0.02 U GACGCGGAGCGCGG 0.063 0.80 (0.52, 1.21) 0.29 V GGAACAAGGGAAGG 0.037 0.43 (0.21, 0.88) 0.02 W GGAACAGAGCAAGA 0.299 Referent X GGAACAGAGCAAGG 0.048 1.44 (0.95, 2.19) 0.09 Y GGAGCAAGGCAAGG 0.050 1.15 (0.77, 1.72) 0.49 Z GGCGCGGAGCAAGG 0.031 1.10 (0.64, 1.87) 0.73 Overall 0.03 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 26 / 31
Application results: Weighted combination Label Haplotype Frequency HR (95% CI) P Q AGAGCGAGACGAGG 0.036 1.19 (0.76, 1.87) 0.44 R AGAGCGAGGGAAGG 0.160 1.05 (0.81, 1.37) 0.73 S AGAGCGGAGCAAGA 0.036 1.20 (0.74, 1.94) 0.47 T AGCGAGAGGCAAGA 0.066 0.53 (0.32, 0.90) 0.02 U GACGCGGAGCGCGG 0.063 0.80 (0.52, 1.21) 0.29 V GGAACAAGGGAAGG 0.037 0.43 (0.21, 0.88) 0.02 W GGAACAGAGCAAGA 0.299 Referent X GGAACAGAGCAAGG 0.048 1.44 (0.95, 2.19) 0.09 Y GGAGCAAGGCAAGG 0.050 1.15 (0.77, 1.72) 0.49 Z GGCGCGGAGCAAGG 0.031 1.10 (0.64, 1.87) 0.73 Overall 0.03 B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 26 / 31
R packages: Non-iterative weighted estimation haplo.ccs [French and Lumley, 2011] Weighted logistic regression for binary outcomes Depends on haplo.stats package to impute haplotypes Calls glm(..., family=quasibinomial(link=logit)) Includes GEE-type sandwich standard error estimator haplo.cph (in process) Weighted Cox regression for censored survival outcomes Will depend on haplo.stats package to impute haplotypes Will call cph(..., robust=true) from Design package Allow stratification and time-varying exposures B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 27 / 31
Conclusions Non-iterative estimation methods Incorporate all diplotypes consistent with the observed genotype, either weighted (weighted estimation) or averaged (weighted combination) by the conditional probability of each diplotype Provide valid tests for genetic associations and reliable estimates of modest genetic effects of common haplotypes Convenient regression framework Adjustment for or interaction with environmental exposures Stratification and time-varying exposures in Cox regression Straightforward to implement in R haplo.ccs for binary outcomes haplo.cph for censored survival outcomes See [French et al., 2006; 2012] for implementation in Stata B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 28 / 31
Target High Mostly known May not exist Relative risk Low Probably can t be found Our focus Rare Common Haplotype frequency B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 29 / 31
Limitations Our methods and/or software may not be applicable to Related individuals Rare haplotypes Small studies Longitudinal outcomes B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 30 / 31
References dbe.med.upenn.edu/biostat-research/bcfrench 1. Cappola TP, Matkovich SJ, et al. 2011. Loss-of-function DNA sequence variant in the CLCNKA chloride channel implicates the cardio-renal axis in interindividual heart failure risk variation. Proc Natl Acad Sci U S A 108:2456 61. 2. French B, Lumley T. 2011. haplo.ccs: Estimate haplotype relative risks in case-control data. R package 1.3.1. 3. French B, Lumley T, et al. 2006. Simple estimates of haplotype relative risks in case-control data. Genet Epidemiol 30:485 94. 4. French B, Lumley T, et al. 2012. Non-iterative, regression-based estimation of haplotype associations with censored survival outcomes. Stat App Genet Mol Biol 11:4. 5. Rebbeck TR, Mitra N, et al. 2009. Modification of ovarian cancer risk by BRAC1/2-interacting genes in a multi center cohort of BRAC1/2 mutation carriers. Cancer Res 71:5792 805. 6. Sinnwell JP, Schaid DJ. 2012. haplo.stats: Statistical analysis of haplotypes with traits and covariates when linkage phase is ambiguous. R package version 1.5.5. 7. Stephens M, Donnelly P. 2003. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:1162 69. B French (bcfrench@upenn.edu) Estimating haplotype associations September 2012 31 / 31