Genotype Imputation. Biostatistics 666

Similar documents
Genotype Imputation. Class Discussion for January 19, 2016

Calculation of IBD probabilities

Calculation of IBD probabilities

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

Variance Component Models for Quantitative Traits. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Lecture 9. QTL Mapping 2: Outbred Populations

Linear Regression (1/1/17)

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

(Genome-wide) association analysis

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Gene mapping in model organisms

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Statistical Methods in Mapping Complex Diseases

Association studies and regression

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA

Introduction to Linkage Disequilibrium

p(d g A,g B )p(g B ), g B

1. Understand the methods for analyzing population structure in genomes

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Nature Genetics: doi: /ng Supplementary Figure 1. Imputation server overview.

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

Statistical issues in QTL mapping in mice

SNP Association Studies with Case-Parent Trios

Use of hidden Markov models for QTL mapping

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Lecture WS Evolutionary Genetics Part I 1

2. Map genetic distance between markers

Lecture 13: Population Structure. October 8, 2012

Multiple QTL mapping

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Inferring Genetic Architecture of Complex Biological Processes

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Supplementary Materials: Meta-analysis of Quantitative Pleiotropic Traits at Gene Level with Multivariate Functional Linear Models

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Bayesian Inference of Interactions and Associations

Supplementary Information for: Detection and interpretation of shared genetic influences on 42 human traits

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Learning ancestral genetic processes using nonparametric Bayesian models

The genomes of recombinant inbred lines

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Introduction to QTL mapping in model organisms

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Improved linear mixed models for genome-wide association studies

I Have the Power in QTL linkage: single and multilocus analysis

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Lecture 11: Multiple trait models for QTL analysis

BTRY 7210: Topics in Quantitative Genomics and Genetics

Introduction to QTL mapping in model organisms

Analytic power calculation for QTL linkage analysis of small pedigrees

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Haplotyping. Biostatistics 666

Introduction to QTL mapping in model organisms

A mixed model based QTL / AM analysis of interactions (G by G, G by E, G by treatment) for plant breeding

Evolution of phenotypic traits

Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Genetic Association Studies in the Presence of Population Structure and Admixture

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

The Admixture Model in Linkage Analysis

Methods for Cryptic Structure. Methods for Cryptic Structure

Pedigree and genomic evaluation of pigs using a terminal cross model

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Backward Genotype-Trait Association. in Case-Control Designs

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Computational Approaches to Statistical Genetics

Efficient Bayesian mixed model analysis increases association power in large cohorts

TOPICS IN STATISTICAL METHODS FOR HUMAN GENE MAPPING

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

Lecture 6. QTL Mapping

SNP-SNP Interactions in Case-Parent Trios

Theoretical Population Biology

Accounting for read depth in the analysis of genotyping-by-sequencing data

Mapping QTL to a phylogenetic tree

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

AFFECTED RELATIVE PAIR LINKAGE STATISTICS THAT MODEL RELATIONSHIP UNCERTAINTY

An Integrated Approach for the Assessment of Chromosomal Abnormalities

ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG Human Population Genomics

Quantitative Genetics & Evolutionary Genetics

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Linkage and Linkage Disequilibrium

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Gene mapping, linkage analysis and computational challenges. Konstantin Strauch

Inference on pedigree structure from genome screen data. Running title: Inference on pedigree structure. Mary Sara McPeek. The University of Chicago

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 December 01.

Lecture 12 April 25, 2018

Transcription:

Genotype Imputation Biostatistics 666

Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives share long stretches of chromosome Sharing at some markers can be used as surrogate for sharing at unobserved markers

Today Genotype Imputation / In Silico Genotyping Use genotypes at a few markers to infer genotypes at other unobserved markers Closely related individuals Long segments of identity by descent Distantly related individuals Shorter segments of identity by descent

Intuition 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 1 2 1 2 2 2 2 2?? 1 1 2 1 2 1 1 1 1 1 1 2 1 2 2 1 2 1 1 2 1 2 1 1 1 1 Given the above pedigree, what are the likely values of the genotype marked?/??

In Silico Genotyping For Family Samples Family members will share large segments of chromosomes If we genotype many related individuals, we will effectively be genotyping a few chromosomes many times In fact, we can: Genotype a few markers on all individuals Identify shared segments of haplotypes Genotype additional markers on a subset of individuals Fill in missing genotypes that fall in shared segments Even without information on shared segments, it may be possible to learn about genotypes of relative members

Genotype Inference Part 1 Observed Genotype Data A/A A/G A/A A/G A/T A/A A/T G/T G/G G/T G/G G/T G/T G/T A/G A/A G/G A/A T/G C/G G/G C/C G/G A/G A/T G/T A/A C/G A/G A/T G/T G/T G/A C/G A/G G/T C/G A/G G/G A/A G/G C/C A/A G/G C/C G/G G/G G/G G/G

Genotype Inference Part 2 Inferring Allele Sharing A A G A A A G A T A A A A T T T T T T G G G T G G G T G G T T G A G A A G G A A T T T T T T T G C G G G C C G G A G A G T A A T T T G T G T G T A A G A T T T T C G C G A G A G A A A A G G G G........................ G T T T G G G G T T T T........................ C G G G C C C C G G G G

Genotype Inference Part 3 Imputing Missing Genotypes A/A A/G A/A A/G A/T A/A A/T G/T G/G G/T G/G G/T G/T G/G A/G A/A G/G A/A T/G C/G G/G C/C G/G A/G A/T G/T A/A C/G A/G A/T G/T G/T G/A C/G A/G G/T C/G A/G A/A G/G A/A A/T G/T G/G A/G C/C A/A A/T G/T G/G A/G C/C G/G A/T A/A G/G G/G A/T A/A G/G

Genotype Imputation in Families Suppose a particular genotype g ij is missing Genotype for person i at marker j Consider full set of observed genotypes G Evaluate pedigree likelihood L for each combination of {G, g ij = x} Posterior probability that g ij = x is P g ij = x G = L(G, g ij = x) L(G) For pairs, same HMM as for linkage analysis or checking relatedness. Large pedigrees, Lander-Green (1987) or Elston-Stewart (1972) algorithm.

Standard Linear Model for Genetic Association Model association using a model such as: E y i = μ + β g g i + β c c i + y i is the phenotype for individual i g i is the genotype for individual i Simplest coding is to set g i = number of copies of the first allele c i is a covariate for individual i Covariates could be estimated ancestry, environmental factors β coefficients are estimated covariate, genotype effects Model is fitted in variance component framework

Model With Inferred Genotypes Replace genotype score g with its expected value: E y i = μ + β g g ҧ + β c c + Where ഥg i = 2P g i = 2 G + P(g i = 1 G) Association test can then be implemented in variance component framework, just as before Alternatives would be to (a) impute genotypes with large posterior probabilities; or (b) integrate joint distribution of unobserved genotypes in family

Example I 1/1 1/1 1/1 1/1 1/1 L(G) =.0061 L(G, g 22 = 1/1) =.00494 L(G, g 22 = 1/2) =.00110 L(G, g 22 = 2/2) =.00006 Assumptions: Two alleles per marker Equal allele frequencies Θ = 0 P(g 22 = 1/1 G) = 0.81 P(g 22 = 1/2 G) = 0.18 P(g 22 = 2/2 G) = 0.01 g ҧ = 1.80

Example II 1/1 1/1 1/1 2/2 2/2 L(G) =.000244 L(G, g 22 = 1/1) =.000061 L(G, g 22 = 1/2) =.000122 L(G, g 22 = 2/2) =.000061 Assumptions: Two alleles per marker Equal allele frequencies Θ = 0 P(g 22 = 1/1 G) = 0.25 P(g 22 = 1/2 G) = 0.50 P(g 22 = 2/2 G) = 0.25 g ҧ = 1.00

Example III 1/1 1/1 1/1 1/1 1/1 L(G) =.0054 L(G, g 22 = 1/1) =.00392 L(G, g 22 = 1/2) =.00136 L(G, g 22 = 2/2) =.00012 Assumptions: Two alleles per marker Equal allele frequencies Θ = 0.10 P(g 22 = 1/1 G) = 0.73 P(g 22 = 1/2 G) = 0.25 P(g 22 = 2/2 G) = 0.02 g ҧ = 1.70

Example IV 1/1 1/1 1/1 2/2 2/2 L(G) =.000121 L(G, g 22 = 1/1) =.000033 L(G, g 22 = 1/2) =.000061 L(G, g 22 = 2/2) =.000028 Assumptions: Two alleles per marker Equal allele frequencies Θ = 0.10 P(g 22 = 1/1 G) = 0.273 P(g 22 = 1/2 G) = 0.499 P(g 22 = 2/2 G) = 0.227 g ҧ = 1.05

Power in Sibships of Size 6 Without Parental Genotype Data Analyze Observed Data Impute when Posterior >.99 Using Expected Genotype Score T is the number of genotyped offspring. QTL explains 5% of variance, polygenes explain 35%, 250 sibships, α = 0.001.

Application: Gene Expression Data Cheung et al (2005) carried out a genome wide association with 27 expression levels as traits Measured in grandparents and parents of CEPH pedigrees and took advantage of HapMap I genotypes SNP consortium genotypes also available for ~6000 SNPs in the offspring of each CEPH family

Example: Gene Expression Data Panels show GWA scan with CTBP1 expression as outcome Gene is at start of chromosome 4 Using observed genotypes, most significant association maps in cis for 15/27 traits 12 of these reach p < 5 * 10-8 Using inferred genotypes, most significant association maps in cis for 19/27 traits 15 of these reach p < 5 * 10-8 Data from Cheung et al. (2005)

Point of Situation When analyzing family samples FOR INDIVIDUALS WITH KNOWN RELATIONSHIPS Impute genotypes in relatives Imputation works through long shared stretches of chromosome But the majority of GWAS that use unrelated individuals FOR INDIVIDUALS WITH UNKNOWN RELATIONSHIPS Impute observed genotypes in relatives Imputation works through short shared stretches of chromosome

In Silico Genotyping For Unrelated Individuals In families, long stretches of shared chromosome In unrelated individuals, shared stretches are much shorter The plan is still to identify stretches of shared chromosome between individuals we then infer intervening genotypes by contrasting samples typing at a few sites with those with denser genotypes

Observed Genotypes Observed Genotypes.... A....... A.... A....... G....... C.... A... Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G G C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T T C T G T G C Study Sample HapMap

Identify Match Among Reference Observed Genotypes.... A....... A.... A....... G....... C.... A... Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G G C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T T C T G T G C

Phase Chromosome, Impute Missing Genotypes Observed Genotypes c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t t t c A t g g Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G G C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T T C T G T G C

Implementation Markov model is used to model each haplotype, conditional on all others At each position, we assume that the haplotype being modeled copies a template haplotype Each individual has two haplotypes, and therefore copies two template haplotypes

Markov Model X X X 1 2 3 X M P X 1 S ) P X 2 S ) P X 3 S ) P X M S ) ( 1 ( 2 ( 3 ( M S 1 S S 2 3 SM P( S 1 ) P S 2 S ) P S 3 S ) P(...) ( 1 ( 2 The final ingredient connects template states along the chromosome

Possible States A state S selects pair of template haplotypes Consider S i as vector with two elements (S i,1, S i,2 ) With H possible haplotypes, H 2 possible states H(H+1)/2 of these are distinct A recombination rate parameter describes probability of switches between states P((S i,1 = a,s i,2 = b) (S i+1,1 = a,s i+1,2 = b)) (1-θ) 2 P((S i,1 = a,s i,2 = b) (S i+1,1 = a*,s i+1,2 = b)) (1-θ)θ/H P((S i,1 = a,s i,2 = b) (S i+1,1 = a*,s i+1,2 = b*)) (θ/h) 2

Emission Probabilities Each value of S implies expected pair of alleles Emission probabilities will be higher when observed genotype matches expected alleles Emission probabilities will be lower when alleles mismatch Let T(S) be a function that provides expected allele pairs for each state S

Emission Probabilities

Does This Really Work? Preliminary Results Used 11 tag SNPs to predict 84 SNPs in CFH Comparison of Test Statistics, Truth vs. Imputed Predicted genotypes differ from original ~1.8% of the time Reasonably similar results possible using various haplotyping methods Experimental Data 0 50 100 150 200 Chi-Square Test Statistic for Disease-Marker Association 0 50 100 150 200 Imputed Data

Does This Really Work? Used about ~300,000 SNPs from Illumina HumanHap300 to impute 2.1M HapMap SNPs in 2500 individuals from a study of type II diabetes Compared imputed genotypes with actual experimental genotypes in a candidate region on chromosome 14 1190 individuals, 521 markers not on Illumina chip Results of comparison Average r 2 with true genotypes 0.92 (median 0.97) 1.4% of imputed alleles mismatch original 2.8% of imputed genotypes mismatch Most errors concentrated on worst 3% of SNPs Scott et al, Science, 2007

Does this really, really work? 90 GAIN psoriasis study samples were re-genotyped for 906,600 SNPs using the Affymetrix 6.0 chip. Comparison of 15,844,334 genotypes for 218,039 SNPs that overlap between the Perlegen and Affymetrix chips resulted in discrepancy rate of 0.25% per genotype (0.12% per allele). Comparison of 57,747,244 imputed and experimentally derived genotypes for 661,881 non-perlegen SNPs present in the Affymetrix 6.0 array resulted in a discrepancy rate of 1.80% per genotype (0.91% per allele). Overall, the average r 2 between imputed genotypes and their experimental counterparts was 0.93. This statistic exceeded 0.80 for >90% of SNPs. Nair et al, Nature Genetics, 2009

LDLR and LDL example Willer et al, Nature Genetics, 2008 Li et al, Annual Review of Genomics and Human Genetics, 2009

Impact of HapMap Imputation on Power Power Disease SNP MAF tagsnps Imputation 2.5% 24.4% 56.2% 5% 55.8% 73.8% 10% 77.4% 87.2% 20% 85.6% 92.0% 50% 93.0% 96.0% Power for Simulated Case Control Studies. Simulations Ensure Equal Power for Directly Genotyped SNPs. Simulated studies used a tag SNP panel that captures 80% of common variants with pairwise r 2 > 0.80.

Combined Lipid Scans SardiNIA (Schlessinger, Uda, et al.) ~4,300 individuals, cohort study FUSION (Mohlke, Boehnke, Collins, et al.) ~2,500 individuals, case-control study of type 2 diabetes DGI (Kathiresan, Altshuler, Orho-Mellander, et al.) ~3,000 individuals, case-control study of type 2 diabetes Individually, 1-3 hits/scan, mostly known loci Analysis: Impute genotypes so that all scans are analyzed at the same SNPs Carry out meta-analysis of results across scans Willer et al, Nature Genetics, 2008

Combined Lipid Scan Results 18 clear loci! Willer et al, Nature Genetics, 2008

New LDL Locus, Previously Associated with CAD

Comparison with Related Traits: Coronary Artery Disease and LDL-C Alleles Gene LDL-C p-value Frequency CAD cases Frequency CAD ctrls CAD p-value APOE/C1/C4 3.0x10-43.209.184 1.0x10-4 1.17 (1.08-1.28) APOE/C1/C4 1.2x10-9.339.319.0068 1.10 (1.02-1.18) SORT1 6.1x10-33.808.778 1.3x10-5 1.20 (1.10-1.31) LDLR 4.2x10-26.902.890 6.7x10-4 1.29 (1.10-1.52) APOB 5.6x10-22.830.824.18 1.04 (0.95-1.14) APOB 8.3x10-12.353.332.0042 1.10 (1.03-1.18) APOB 3.1x10-9.536.520.028 1.07 (1.00-1.14) PCSK9 3.5x10-11.825.807.0042 1.13 (1.03-1.23) NCAN/CILP2 2.7x10-9.922.915.055 1.11 (0.98-1.26) B3GALT4 5.1x10-8.399.385.039 1.07 (0.99-1.14) B4GALT4 1.0x10-6.874.865.051 1.09 (0.98-1.20) Comparison to data from WTCCC (Nature, 2007) was made possible by imputation. OR

Does This Work Across Populations? Conrad et al. (2006) dataset Tag SNP Portability 52 regions, each ~330 kb Human Genome Diversity Panel ~927 individuals, 52 populations 1864 SNPs Grid of 872 SNPs used as tags Predicted genotypes for the other 992 SNPs Compared predictions to actual genotypes

(Evaluation Using ~1 SNP per 10kb in 52 x 300kb regions For Imputation)

Summary Genotype imputation can be used to accurately estimate missing genotypes Genotype imputation is usually implemented through using a Hidden Markov Model Benefits of genotype imputation Increases power of genetic association studies Facilitates analyses that combine data across studies Facilitates interpretation of results

2017 Imputation Accuracy: Europeans (Complete Genomics as Truth)

Imputation Servers https://imputationserver.sph.umich.edu

Recommended Reading Chen and Abecasis (2007) Family based association tests for genome wide association scans. Am J Hum Genet 81:913-926 Li et al (2010) Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology 34:816-834