Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis

Size: px

Start display at page:

Download "Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis"

Anthony Wilson
5 years ago
Views:

1 Supplementary Information: Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis Bjarke Feenstra 1*, Frank Geller 1*, Camilla Krogh 1, Mads V. Hollegaard 2, Sanne Gørtz 1, Heather A. Boyd 1, Jeffrey C. Murray 3, David M. Hougaard 2, Mads Melbye 1 1 Department of Epidemiology Research, Statens Serum Institut, Copenhagen, Denmark 2 Section of Neonatal Screening and Hormones, Statens Serum Institut, Copenhagen, Denmark 3 Department of Pediatrics, University of Iowa, Iowa City, IA, USA * - these authors contributed equally Correspondence should be addressed to B.F. (fee@ssi.dk) or M.M. (mme@ssi.dk). 1

2 Contents: Supplementary Note: pages 3 5 Subjects, Sampling, Amplification and Genotyping, Population Stratification Analysis, Imputation, Analysis of Long-Range Haplotypes in the Chromosome 3 Region. Supplementary Tables: pages 6 17 Supplementary Table 1. Supplementary Table 2. Supplementary Table 3. Supplementary Table 4. Supplementary Table 5. Phenotype definition. Basic characteristics of the discovery and replication sample Discovery, replication and combined results for one additional SNPs at each of the six top loci associated with IHPS. Results of the loci associated with IHPS in the discovery phase, without and with correction for possible population stratification. Association results for 294 imputed SNPs with P < 10-6 across the 3p25.1, 3p25.2, and 5q35.2 loci. Supplementary Table 6. Long range haplotypes for chromomosome 3. Supplementary Table 7. Association results for IHPS stratified by sex. Supplementary Figures: pages Supplementary Figure 1. Quantile-quantile plot. Supplementary Figure 2. Regional association plot showing imputed SNP results for the first confirmed locus on chromosome 3q25.1. Supplementary Figure 3. Regional association plot showing imputed SNP results for the second confirmed locus on chromosome 3q25.2. Supplementary Figure 4. Regional association plot showing imputed SNP results for the third confirmed locus on chromosome 5q35.2. References: page 22 2

3 Supplementary Note: Subjects IHPS cases were identified based on the Danish National Patient Register, which covers all hospital discharge diagnoses and operations performed since Eligible cases were defined as children who, in their first year of life, had a pyloromyotomy according to the Danish Classification of Surgical Procedures codes up to December 1995 (International Statistical Classification of Diseases, Eighth Revision (ICD-8) codes 41840, 41841, 44100) and the Nordic Classification of Surgical Procedures codes after January 1996 (ICD-10 codes KJDH60, KJDH61). The total number of cases born between 1977 and 2008 and operated within the first year was 3,366 (more than 90% of these were operated within the first two months). We excluded multiple births and major malformations at birth according to the EUROCAT classification ( We also excluded the following pregnancy complications: placenta praevia, placental abruption, placental insufficiency, hydramnios, isoimmunization, and preeclampsia. To ensure a high degree of genetic homogeneity in the genotyped sample, we obtained birthplace information from the Danish Civil Registry 2, and only included cases who themselves as well as their parents were born in Scandinavia and whose grandparents were not born outside of Northwestern Europe. 1,048 case samples were selected for genome-wide SNP genotyping, prioritizing most recent birth years. Of these, 46 were dropped after pre-testing, 1,002 were submitted for genome-wide genotyping, and 1,001 were successfully genotyped. The control group used in this study consisted of 2,401 non-affected children from our ongoing GWAS of preterm delivery 3, with the same exclusion criteria applied. For replication, we had 796 cases and 876 controls using the same case and control definitions as in the discovery stage. Sampling, Amplification and Genotyping All samples were retrieved from the Danish Newborn Screening Biobank or from the biobank of the Danish National Birth Cohort, both of which are part of the Danish National Biobank. Sampling and genotyping of the discovery stage subjects was undertaken in two rounds. In the first round, 1,901 children were sampled from buffy coat (1,440) or dried blood spot samples (461) as a part of our preterm delivery GWAS. DNA extraction and genotyping was performed at Johns Hopkins University Center for Inherited Disease Research (CIDR). For 55 samples whole-genome amplification was performed due to low yields of genomic DNA. GWAS genotyping was done with Illumina (Illumina, San Diego, CA, USA) Human 660W-Quadv1_A chip. In the second round, all IHPS cases as well as 500 extra samples from the preterm delivery GWAS were sampled using two 3mm punches from dried blood spot samples. For these samples, DNA was extracted and wholegenome amplified at Statens Serum Institut using a protocol optimized for dried blood spot samples as previously described 4. GWAS genotyping was again performed at CIDR using the 3

4 Illumina Human 660W-Quadv1_A chip. For replication, we sampled 796 cases and 876 controls using punches of dried blood spot samples. DNA was extracted and whole-genome amplified at Statens Serum Institut. Genotyping of two correlated SNPs at each of the 6 most significantly associated loci was performed at decode Genetics using the Centaurus platform (Nanogen, Bothell, WA, USA). Population Stratification Analysis To investigate effects of possible population substructure, we performed multidimensional scaling analysis (as implemented in PLINK 5 ) on the discovery data, using independent autosomal SNPs with missing call rates < 1%, minor allele frequency > 5%, and Hardy-Weinberg P value > We utilized PLINK's LD pruning function to remove short and long-range LD. The resulting 22,766 SNPs were analyzed along with founder genotypes from 11 HapMap phase III reference populations. We included the first five dimensions as covariates in a logistic regression model, and redid the association analysis of the discovery data. There was no evidence of population substructure playing any significant role; all five dimensions had P values > 0.05 in the model, the genomic inflation factor was unchanged at 1.06, and the association results for the top loci were essentially the same (Supplementary Table 4). Imputation In order to explore the association signals further, we imputed unobserved genotypes in the three confirmed regions using phased haplotypes from 381 unrelated individuals from five populations of European ancestry obtained from the Interim Phase I release of the 1000 Genomes Project 6. We imputed 1.5Mb to either side of the three top SNPs, i.e. the entire region between the two chromosome 3 SNPs was included. Imputation was done in a two step procedure. In a first prephasing step, we used MaCH 7 to estimate haplotypes for the IHPS study samples. In a second step, we imputed missing alleles for additional SNPs directly onto these phased haplotypes using Minimac 7. All imputed SNPs with imputation quality r 2 > 0.30 were tested for association with IHPS in a logistic regression of disease status on imputed allele dosage (to account for imputation uncertainty) using mach2dat 7. In addition, we carried out conditional analyses including the genotype of the confirmed genotyped SNP as a covariate in the logistic regression model. We estimated the correlation between the top genotyped SNP and the imputed SNPs by the squared Pearson correlation coefficient between allele count of the genotyped SNP and allele dosages of 4

5 imputed SNPs. Imputation results are shown in Supplementary Table 5 and Supplementary Figures 2 4. Analysis of Long-Range Haplotypes in the Chromosome 3 Region To investigate the possibility of long-range haplotypes explaining the seemingly independent association with IHPS observed for the SNPs rs and rs573872, we analyzed the region spanning from the start of the LD block harboring rs to the end of the block containing rs ( Mb). In line with a previous study 8, we aimed at long range haplotypes in cases and therefore selected additional tag SNPs based on an association P value < 0.01 and a ratio of D cases/d controls > 1.2 (either for rs or rs573872). This lead to the construction of haplotypes based on 12 SNPs with fastphase 9. A total of 47 haplotypes had a frequency estimate > 0.5% in either cases or controls (Supplementary Table 6). Odds ratio estimates ranged from 0.4 to 2.2 and were well in line with what could be expected based on the included alleles of rs and rs

6 Supplementary Table 1. Phenotype definition. Inclusion criterion for cases: Confirmed Pyloromyotomi within the first year of life Exclusion criteria for cases and controls: Multiple births Major malformations (according to the EUROCAT classification) Pregnancy complications (placenta praevia, placental abruption, placental insufficiency, hydramnios, isoimmunization, and preeclampsia) Birthplace outside of Scandinavia, parents birthplace outside of Scandinavia, grandparents birthplace outside of North-Western Europe Supplementary Table 2. Basic characteristics of the discovery and replication sample. Study group N % boys Mean year of birth (SD) a Discovery stage 3, % 1999 (4.9) cases 1, % 1996 (6.5) controls 2, % 2001 (3.0) Replication stage 1, % 1995 (7.2) cases % 1990 (7.5) controls % 2000 (1.3) a The replication cases were born an average of 6 years earlier than discovery cases. This was by design to address the concern that blood spot samples that had been stored for many years might not give as good yield in GWAS genotyping as those stored for fewer years and therefore prioritized cases born most recently for the discovery stage. 6

7 Supplementary Table 3. Discovery, replication and combined results for one additional SNPs at each of the six top loci associated with IHPS. These additional SNPs were runner-ups in the discovery analysis and show high correlation to the primary SNPs in the combined discovery and replication sample. Alleles Discovery Replication Combined Position Frequency Number Odds Ratio Frequency Number Odds Ratio Number Odds Ratio chr SNP r 2 a (bp) Eff Alt Cases Controls Cases Controls (95% CI) P value Cases Controls Cases Controls (95% CI) P value Cases Controls (95% CI) P value Het P 3 rs G A ( ) 1.8e ( ) 4.7e ( ) 7.2e rs T C ( ) 4.3e ( ) 5.6e ( ) 2.1e rs G A ( ) 1.3e ( ) 4.7e ( ) 3.6e rs b G A ( ) 1.2e ( ) ( ) 5.8e rs A G ( ) 2.6e ( ) ( ) 2.1e rs G A ( ) 1.7e ( ) ( ) 1.4e r 2, r 2 to top SNP at locus; Eff, effect allele; Alt, alternative allele; Frequency, effect allele frequency; CI, confidence interval; Het P, P value for test of heterogeneity using the I 2 statistic. a Correlation coefficients (r 2 ) to top SNP at locus based on the combined discovery and replication sample apart from rs , where it is based on data from HapMap CEU sample. b rs was not genotyped in the discovery stage; instead discovery stage results for the (perfectly correlated) top SNP at the locus, rs , are used in the table. 7

8 Supplementary Table 4. Results of the loci associated with IHPS in the discovery phase, without and with correction for possible population stratification. Correction for population stratification was performed by including the first five dimensions in a multidimensional scaling analysis as covariates in logistic regression of IHPS disease status on SNP allele count. Discovery Discovery with five MDS dimensions Position Alleles Frequency Number Odds Ratio Odds Ratio chr SNP (bp) Eff Alt Cases Controls Cases Controls (95% CI) P value (95% CI) P value 3 rs A G ( ) 5.5e ( ) 1.6e-12 3 rs G T ( ) 3.9e ( ) 5.3e-07 5 rs A G ( ) 5.7e ( ) 1.1e-10 6 rs C T ( ) 1.2e ( ) 1.6e rs G T ( ) 8.5e ( ) 1.3e rs T C ( ) 4.1e ( ) 6.3e-07 MDS, multi-dimensional scaling; Eff, effect allele; Alt, alternative allele; Frequency, effect allele frequency; CI, confidence interval 8

9 Supplementary Table 5. Association results for 294 imputed SNPs with P < 10-6 across the 3p25.1, 3p25.2, and 5q35.2 loci. The table is sorted by basepair position (NCBI build 37) and shows effect (Eff) and alternative (Alt) allele; effect allele frequency; r 2 value from MaCH indicating imputation quality; odds ratio for association with IHPS; P value (after genomic control, λ = 1.06 from discovery scan); indicator of whether the SNP was genotyped; squared Pearson correlation coefficient (r 2 ) or imputed SNP allele dosage to allele count for the top genotyped SNP at the locus (rs , rs573872, or rs29784); function class and gene name for SNPs in genes. Results for the three loci are separated by bold lines. chr SNP Position (bp) Alleles Eff Alt Effect Allele Freq Imputation r 2 Odds Ratio P value Genotyped r 2 Class Gene 3 rs T C e rs C T e rs A G e rs A G e rs G A e rs A G e rs T C e rs T G e rs A C e rs T G e rs C G e rs A T e rs G C e rs A G e rs A G e rs C T e rs G T e rs T A e rs C T e rs C A e rs A C e rs A C e rs T A e rs T C e rs G A e rs T C e rs G A e rs C G e rs A G e rs C A e rs T C e rs A G e rs T C e rs G T e rs G C e rs G A e rs T C e rs C A e

10 3 rs A T e rs G A e rs A G e rs A T e rs T G e rs T C e rs G A e rs C T e rs G A e rs C T e rs G C e rs T G e rs C T e rs A G e rs A T e rs T C e rs T C e rs C T e rs T C e rs T A e rs G A e rs T G e rs T G e rs T C e rs G A e rs A G e rs T G e rs G A e rs C T e rs G A e rs G C e rs A G e rs C A e rs C T e rs A G e rs G C e rs C T e rs A G e rs G A e rs C T e rs T C e rs G A e rs G A e rs C T e rs G A e rs T C e rs A T e rs G T e rs C T e rs T C e rs T A e rs T C e rs A G e

11 3 rs C T e rs C A e rs C T e rs G A e rs G A e rs C T e rs T C e rs T C e rs A G e rs T A e rs T A e rs A T e rs T C e rs A T e rs A T e rs A G e rs T G e rs T C e rs A C e rs T C e rs C T e rs T C e rs A T e rs T C e rs C A e rs A G e rs T C e rs C G e rs T C e rs T C e rs G C e rs A C e rs A C e rs A G e rs C T e rs G A e rs C G e rs C A e rs T C e rs T A e rs G A e rs G A e rs G A e rs T C e rs T C e rs A C e rs A G e rs A C e rs G T e rs A G e rs A G e rs A G e rs C T e

12 3 rs A G e rs T C e rs T C e rs A G e rs G A e rs A G e rs A T e rs G A e rs G A e rs T G e rs T G e rs A G e rs C T e rs G T e rs A G e rs A C e rs A T e rs A C e rs C T e rs T G e rs C T e rs T C e rs A T e rs A T e rs G C e rs G C e rs A C e rs A G e rs G A e rs G T e rs G A e rs C T e rs G A e rs G A e rs C T e rs T C e rs A G e rs G C e rs T A e rs T C e rs C T e rs C T e rs T G e rs T C e rs C T e rs T G e rs C T e rs G A e rs G C e rs T C e rs A G e rs C T e rs A C e

13 3 rs C A e rs T A e rs A C e rs A C e rs A G e rs T C e rs G A e rs T A e rs T C e rs T C e rs T C e rs C G e upstream-variant-2kb C5orf41 5 rs C G e upstream-variant-2kb C5orf41 5 rs T C e intron-variant C5orf41 5 rs A G e intron-variant C5orf41 5 rs T C e intron-variant C5orf41 5 rs A G e intron-variant C5orf41 5 rs A G e intron-variant C5orf41 5 rs G A e intron-variant C5orf41 5 rs T C e intron-variant C5orf41 5 rs T C e intron-variant C5orf41 5 rs T C e intron-variant C5orf41 5 rs T C e intron-variant C5orf41 5 chr5: T C e intron-variant C5orf41 5 rs T A e rs C T e upstream-variant-2kb BNIP1 5 rs G A e intron-variant BNIP1 5 rs A G e intron-variant BNIP1 5 rs T A e intron-variant BNIP1 5 rs T C e intron-variant BNIP1 5 rs T C e intron-variant BNIP1 5 rs T C e intron-variant BNIP1 5 rs A G e intron-variant BNIP1 5 rs T G e intron-variant BNIP1 5 rs A G e rs G A e rs T C e rs C T e rs C A e rs T C e rs T C e rs T C e rs A G e rs C T e rs C T e rs C T e rs T C e rs T C e rs A G e rs C T e rs A G e rs T A e rs A G e

14 5 rs G C e rs G T e rs C A e rs C A e rs C T e chr5: T C e rs C T e rs T C e rs C T e rs G A e rs G A e rs G A e rs A G e rs C A e rs T C e rs A G e rs T C e rs T C e rs G C e rs G C e rs G A e rs T C e rs C T e rs A T e rs C T e rs A T e rs A T e rs T C e rs T G e rs G A e rs T C e rs G C e rs G C e rs G T e rs G A e rs C A e rs C T e rs G A e rs T G e rs A G e rs A C e rs C T e rs C G e

15 Supplementary Table 6. Long range haplotypes for chromomosome 3. Haplotype Freq. in cases SE Freq. in controls SE OR

16 Legend red haplotypes: both risk alleles rs a rs g blue haplotypes: both non-risk alleles rs g rs t bold haplotypes have a frequency > 2% in at least one group SNP position and alleles Position bp SNP 0 allele 1 allele rs A G rs A G rs A G rs G A rs T C rs G T rs A G rs G A rs T C rs G A rs C T rs G T 16

17 Supplementary Table 7. Association results for IHPS stratified by sex. Results are given separately for the discovery data, the replication data, and the discovery and replication data combined by meta-analysis. Alleles Boys Girls Heterogeneity Position Frequency Number Odds Ratio Frequency Number Odds Ratio chr SNP (bp) Eff Alt Cases Controls Cases Controls (95% CI) P value Cases Controls Cases Controls (95% CI) P value Het P Discovery 3 rs A G ( ) 4.4e ( ) 4.5e rs G T ( ) 2.0e ( ) 8.7e rs A G ( ) 4.5e ( ) 6.4e rs C T ( ) 9.7e ( ) 3.8e rs G T ( ) 2.3e ( ) 2.8e rs T C ( ) 1.7e ( ) e-03 Replication 3 rs A G ( ) 1.7e ( ) 4.8e rs G T ( ) 1.6e ( ) 1.1e rs A G ( ) 1.8e ( ) 9.8e rs C T ( ) ( ) rs G T ( ) ( ) rs T C ( ) ( ) Combined 3 rs A G ( ) 2.5e ( ) 1.8e rs G T ( ) 2.2e ( ) 4.5e rs A G ( ) 2.2e ( ) 2.1e rs C T ( ) 4.7e ( ) 3.5e rs G T ( ) 1.6e ( ) 3.5e-04 b rs T C ( ) 7.6e-08 a ( ) e-03 Eff, effect allele; Alt, alternative allele; Frequency, effect allele frequency; CI, confidence interval; Het P, P value for test of heterogeneity between results for boys and girls using the I 2 statistic. a In the combined results for boys there was significant heterogeneity between discovery and replication results for rs (P = 0.004). b In the combined results for girls there was significant heterogeneity between discovery and replication results for rs (P = 0.002). 17

18 Supplementary Figure 1. Quantile-quantile plot of observed versus expected log 10 P values from the genome-wide scan for IHPS after correction by genomic control (λ = 1.06). The expected distribution of log 10 P values under the null hypothesis is shown by the grey line. 18

19 Supplementary Figure 2. Regional association plot showing imputed SNP results for the first confirmed locus on chromosome 3q25.1. SNPs were imputed using the Interim Phase I haplotype release (June 2011) of the European samples from the 1000 Genomes Project as reference. The figure shows results of: a) logistic regression of disease status on single SNP allele dosage, and b) conditional analysis including the genotype of rs as a covariate in the logistic regression model. SNPs are plotted by chromosomal position (NCBI build 37) against log 10 P value of IHPS association. The colors reflect LD (based on pairwise r 2 values from the 1000 Genomes project. Red: r 2 > 0.8; orange: 0.6 < r 2 < 0.8; green: 0.4 < r 2 < 0.6; light blue: 0.2 < r 2 < 0.4; purple: r 2 < 0.2) to the confirmed genotyped SNP at the locus. Estimated recombination rates (from HapMap) are plotted to reflect the local LD structure. Genes are indicated in the lower panel of the plot. The figure was generated using LocusZoom

20 Supplementary Figure 3. Regional association plot showing imputed SNP results for the second confirmed locus on chromosome 3q25.2. SNPs were imputed using the Interim Phase I haplotype release (June 2011) of the European samples from the 1000 Genomes Project as reference. The figure shows results of: a) logistic regression of disease status on single SNP allele dosage, and b) conditional analysis including the genotype of rs as a covariate in the logistic regression model. Axes and symbols are defined as in Supplementary Figure 2 and the same color coding is used to reflect LD to the confirmed genotyped SNP at the locus. The figure was generated using LocusZoom

21 Supplementary Figure 4. Regional association plot showing imputed SNP results for the third confirmed locus on chromosome 5q35.2. SNPs were imputed using the Interim Phase I haplotype release (June 2011) of the European samples from the 1000 Genomes Project as reference. The figure shows results of: a) logistic regression of disease status on single SNP allele dosage, and b) conditional analysis including the genotype of rs29784 as a covariate in the logistic regression model. Axes and symbols are defined as in Supplementary Figure 2 and the same color coding is used to reflect LD to the confirmed genotyped SNP at the locus. The figure was generated using LocusZoom

22 References 1. Andersen,T.F., Madsen,M., Jorgensen,J., Mellemkjoer,L., & Olsen,J.H. The Danish National Hospital Register. A valuable source of data for modern health sciences. Dan. Med. Bull. 46, (1999). 2. Pedersen,C.B., Gotzsche,H., Moller,J.O., & Mortensen,P.B. The Danish Civil Registration System. A cohort of eight million persons. Dan. Med. Bull. 53, (2006). 3. Cornelis,M.C. et al. The Gene, Environment Association Studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions. Genet. Epidemiol. 34, (2010). 4. Hollegaard,M.V. et al. Genome-wide scans using archived neonatal dried blood spot samples. BMC. Genomics 10, 297 (2009). 5. Purcell,S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, (2007). 6. A map of human genome variation from population-scale sequencing. Nature 467, (2010). 7. Li,Y., Willer,C.J., Ding,J., Scheet,P., & Abecasis,G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, (2010). 8. Wang,K. et al. Interpretation of association signals and identification of causal variants from genome-wide association studies. Am. J. Hum. Genet. 86, (2010). 9. Scheet,P. & Stephens,M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, (2006). 10. Pruim,R.J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics. 26, (2010). 22

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives