Haplotype-based variant detection from short-read sequencing

Similar documents
1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Genotype Imputation. Class Discussion for January 19, 2016

Supporting information for Demographic history and rare allele sharing among human populations.

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Phasing via the Expectation Maximization (EM) Algorithm

Frequency Spectra and Inference in Population Genetics

Genotype Imputation. Biostatistics 666

What is the expectation maximization algorithm?

The problem Lineage model Examples. The lineage model

p(d g A,g B )p(g B ), g B

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Detecting selection from differentiation between populations: the FLK and hapflk approach.

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Neutral Theory of Molecular Evolution

Bayesian Inference of Interactions and Associations

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Lecture 13: Population Structure. October 8, 2012

Fitness landscapes and seascapes

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Multiple QTL mapping

Calculation of IBD probabilities

Classical Selection, Balancing Selection, and Neutral Mutations

2. Map genetic distance between markers

Calculation of IBD probabilities

Learning Energy-Based Models of High-Dimensional Data

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

I of a gene sampled from a randomly mating popdation,

Computational Systems Biology: Biology X

an introduction to bayesian inference

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

(Genome-wide) association analysis

SAT in Bioinformatics: Making the Case with Haplotype Inference

Causal Graphical Models in Systems Genetics

The Wright-Fisher Model and Genetic Drift

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Bayesian Regression (1/31/13)

Y-STR: Haplotype Frequency Estimation and Evidence Calculation

Learning ancestral genetic processes using nonparametric Bayesian models

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Use of hidden Markov models for QTL mapping

Theoretical Population Biology

A Principled Approach to Deriving Approximate Conditional Sampling. Distributions in Population Genetics Models with Recombination

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Supporting Information

Processes of Evolution

The Admixture Model in Linkage Analysis

Unifying theories of molecular, community and network evolution 1

Quantitative Biology II Lecture 4: Variational Methods

De novo assembly and genotyping of variants using colored de Bruijn graphs

Algorithmisches Lernen/Machine Learning

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

CNV Methods File format v2.0 Software v2.0.0 September, 2011

Modelling Genetic Variations with Fragmentation-Coagulation Processes

STA 4273H: Statistical Machine Learning

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Bayes: All uncertainty is described using probability.

Genetic Variation in Finite Populations

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Supporting Information Text S1

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear Regression (1/1/17)

BIOINFORMATICS. Discrepancies in dbsnp confirmation rates and allele frequency distributions from varying genotyping error rates and patterns

Fei Lu. Post doctoral Associate Cornell University

Latent Variable models for GWAs

Scaling Neighbourhood Methods

Association studies and regression

Supporting Information

QTL model selection: key players

Recombination rate estimation in the presence of hotspots

Latent Variable Models

Expected complete data log-likelihood and EM

A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

Estimating Evolutionary Trees. Phylogenetic Methods

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Closed-form sampling formulas for the coalescent with recombination

Haplotyping. Biostatistics 666

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Quantile POD for Hit-Miss Data

Default Priors and Effcient Posterior Computation in Bayesian

Multivariate analysis of genetic data an introduction

Mathematical models in population genetics II

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Comparative Genomics II

Population Genetics I. Bio


Computational statistics

Supplementary Information for Discovery and characterization of indel and point mutations

Optimal Methods for Using Posterior Probabilities in Association Testing

Computational Approaches to Statistical Genetics

Transcription:

Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale haplotype structure [Browning and Browning, 2007, Delaneau et al., 2012, Howie et al., 2011, Li et al., 2010], sequencing traces provide short-range phasing information that may be employed directly in primary variant detection to establish phase between proximal alleles. Present read lengths and error rates limit this physical phasing approach to variants clustered within tens of bases, but as the cost of obtaining long sequencing traces decreases [Branton et al., 2008, Clarke et al., 2009], physical phasing methods will enable the determination of larger haplotype structure directly using only sequence information from a single sample. The direct determination of haplotype structure from physical phasing information may provide a number of benefits for existing variant detection and genotyping approaches. Physical phasing can lower the computational burden of genotype imputation methods by reducing the possible space of haplotypes which must be considered. Similarly, physical phasing can be used to improve genotyping accuracy in the context of rare variations that can be difficult to impute due to sparse linkage information. Physical phasing methods can also improve allele detection accuracy, and ensure correct allele detection and genotyping in the context of overlapping and clustered alleles. Haplotype-based detection methods require some form of assembly or homogenization of the alignments on which they operate, and thus these methods can also be used to handle characteristic sources of error generated by sequence misalignment. Provided that sequencing errors are relatively uncorrelated in comparison to true variations, the detection of longer alleles will improve the overall signal to noise ratio of variant calls. The space of possible alleles, and thus erroneous haplotypes, grows dramatically with the length of the haplotype, while the number of true alleles does not. The direct detection of haplotypes from alignment data presents a number of challanges to existing variant detection methods. As the length of a haplotype increases, so does the number of possible alleles within the haplotype, and thus methods designed to detect genetic variation over haplotypes in a unified context must be able to model multiallelism. However, most polymorphism detection methods establish estimates of the likelihood of polymorphism at a given loci using statistical models which assume biallelism [Li, 2011, Marth et al., 1999] and uniform, typically diploid, copy number [DePristo et al., 2011]. Moreover, improper modeling of copy number impedes the accurate detection of small variants on sex chromosomes, in polyploid organisms, or in locations with known copy-number variaions, where called alleles, genotypes, and likelihoods should reflect local copy number and global ploidy. 1

To enable the application of population-level inference methods to the detection of haplotypes, we generalize the Bayesian statistical method described by Marth et al. [1999] to allow multiallelic loci and non-uniform copy number across the samples under consideration. We have implemented this model in FreeBayes [Garrison, 2012, 2011]. 2 Generalizing variant detection to multiallelic loci and non-uniform copy number 2.1 Definitions At a given genetic locus we have n samples, each of which has a copy number or multiplicity of m within the locus. We denote the number of copies of the locus present within our set of samples as M = n i=1 m i. Among these M copies we have k distinct alleles, b 1,..., b k with allele frequencies f 1,..., f k. Each individual has an unphased genotype G i comprised of k i distinct alleles b i1,..., b ik and corresponding genotype allele frequencies f i1,..., f iki, which may be equivalently expressed as a multiset of alleles B i : B i = m i. For the purposes of our analysis, we assume that we cannot accurately discern phasing information outside of the haplotype detection window, so our G i are unordered and all G i containing equivalent alleles and frequencies are regarded as equivalent. Assume a set of s i sequencing observations r i1,..., r isi = R i over each sample in our set of n samples such that there are n i=1 s i reads at the genetic locus under analysis. Q i denotes the mapping quality, or probability that the read r i is mis-mapped against the reference. 2.2 A Bayesian approach To genotype the samples at a specific locus, we could simply apply a Bayesian statistic relating P (G i R i ) to the likelihood of sequencing errors in our reads and the prior likelihood of specific genotypes. However, this maximum-likelihood approach limits our ability to incorporate statistical information about the population under analysis. By including all relevant sequencing information from our population in the analysis, we improve our ability to differentiate unique variants from sequencing errors without requiring that the variation in the population under analysis has already been characterized sufficiently to be used to estimate genotype likelhoods. Given a set of genotypes G 1,..., G n and a set of observations R 1,..., R n for all samples at the current genetic locus, we can use Bayes theorem to formulate the probability of a specific combination of genotypes as such: P (G 1,..., G n R 1,..., R n ) = P (G 1,..., G n )P (R 1,..., R n G 1,..., G n ) P (R 1,..., R n ) P (G 1,..., G n R 1,..., R n ) = P (G 1,..., G n ) n i=1 P (R i G i ) G 1,...,G n (P (G 1,..., G n ) n i=1 P (R i G i )) Under this decomposition, P (R 1,..., R n G 1,..., G n ) = n i=1 P (R i G i ) represents the likelihood that our observations match a given genotype combination (our data likelihood), and (1) (2) 2

P (G 1,..., G n ) represents the prior likelihood of observing a specific genotype combination. We estimate the data likelihood as the joint probability that the observations for a specific individual support a given genotype. We use a neutral model of allele diffusion conditioned on an estimated mutation rate to estimate the prior probability of a given collection of genotypes. Except for situations with small numbers of samples and alleles, we avoid the explicit evaluation of the posterior distribution as implied by (2), instead using a number of optimizations to make the algorithm tractable to apply to very large datasets (see section 3.3). 2.3 Estimating the probability of a sample genotype given sequencing observations, P (R i G i ) Given a set of reads R i = r i1,..., r isi of a sample at a given locus, we can extract a set of k i observed alleles B i = b 1,..., b k i which encapsulate the potential set of represented variants at the locus in the given sample, including erroneous observations. Each of these observed alleles b l has a frequency o f within the observations of the individual sample : k i j=1 o j = s i and corresponds to a true allele b l. If we had perfect observations of a locus, P (R i G i ) for any individual would approximate the probability of sampling observations R i out of G i with replacement. This probability is given by the multinomial distribution in s i over the probability P (b l ) of obtaining each allele from the given genotype, which is f i j m i for each allele frequency in the frequencies which define G i, f i1,..., f iki. P (R i G i ) P (B i G i ) = (R i G i ) s i! ki j=1 o j! k i ( fij j=1 m i ) o j (3) Our observations are not perfect, and thus we must account for the probability of errors in the reads. We can use the per-base quality scores provided by sequencing systems to develop the probability that an observed allele is drawn from an underlying true allele, P (B i R i ). We assume a mapping between sequencing quality scores and allele qualities such that each observed allele b l has a corresponding quality q l which approximates the probability that the observed allele is incorrect. Furthermore, we must sum P (R i G i ) for all possible R i combinations (R i G i : R i = k i ) drawn from our genotype to obtain the joint probability of R i given G i, as each s i l=1 P (b l b l) only accounts for the marginal probability of the a specific R i given B i. This extends P (R i G i ) as follows: P (R i G i ) = ( s i! k i ( ) ) o fij j s i ki j=1 o j! P (b m l b l ) (4) i In summary, the probability of obtaining a set of reads given an underlying genotype is proportional to the probability of sampling the set of observations from the underlying genotype, scaled by the probability that our reads are correct. As q l approximates the j=1 l=1 3

probability that a specific b l is incorrect, P (b l b l) = 1 q l when b l G i and P (b l b l) = q l when b l / G i. 2.4 Priors for unphased genotype combinations, P (G 1,..., G n ) 2.4.1 Decomposition of prior probability of genotype combination Let G 1,..., G n denote the set of genotypes at the locus and f 1,..., f k denote the set of allele frequencies which corresponds to these genotypes. We estimate the prior likelihood of observing a specific combination of genotypes within a given locus by decomposition into resolvable terms: P (G 1,..., G n ) = P (G 1,..., G n f 1,..., f k ) (5) The probability of a given genotype combination is equivalent to the intersection of that probability and the probability of the corresponding set of allele frequencies. This identity follows from the fact that the allele frequencies are derived from the set of genotypes and we always will have the same f 1,..., f k for any equilavent G 1,..., G n. Following Bayes Rule, this identity further decomposes to: P (G 1,..., G n f 1,..., f k ) = P (G 1,..., G n f 1,..., f k )P (f 1,..., f k ) (6) We now can estimate the prior probability of G 1,..., G n in terms of the genotype combination sampling probability, P (G 1,..., G n f 1,..., f k ), and the probability of observing a given allele frequency in our population, P (f 1,..., f k ). 2.4.2 Genotype combination sampling probability P (G 1,..., G n f 1,..., f k ) Under neutral selection and unbiase distribution of alleles across the population, the probability of obtaining a specific G 1,..., G n given allele frequencies f 1,..., f k is given by inverse of the number of ways we could arrange our k alleles into M locations: ( ) M 1. f 1,...,f k As we cannot accurately phase genotypes using direct short-read sequencing observations, we cannot assume the ability to establish the phasing of genotypes within our set of samples. Thus we must scale this probability by the product of all possible multiset permutations of all genotypes, n ( m i ) i=1 f i1,...,f iki, thus yielding: ( ) 1 M n ( ) m i P (G 1,..., G n f 1,..., f k ) = = 1 f 1,..., f k f i1,..., f iki M! i=1 k f l! l=1 n i=1 m i! ki j=1 f i j! (7) In the case of a fully diploid population, the product of all possible multiset permutations of all genotypes reduces to 2 h, where h is the number of heterozygous genotypes, simplifying (7) to: P (G 1,..., G n f 1,..., f k ) = 2 h ( M f 1,..., f k ) 1 (8) 4

This component of the prior is important in the context of resequencing, as genomes contain paralogous regions which can generate spurious signals of variation due to read mapping imprecision. In practice, loci with this characteristic error tend to have many individuals with a small fraction of reads supporting an alternate allele. In such cases, we expect both weak data likelihood for variant genotypes and a weak component for this prior to reduce the likelihood of calling a variant. 2.4.3 Derivation of P (f 1,..., f k ) by Ewens sampling formula Provided our sample size n is small relative to the population which it samples, and the population is in equilibrium under mutation and genetic drift, the probability of observing a given set of allele frequencies at a locus is given by Ewens sampling formula [Ewens, 1972], which relates the probability of a given set of allele frequencies to the average mutation rate of the locus. The application of Ewens formula to our context is straightforward. Let a f be the number of alleles among b 1,..., b k whose allele frequency within our sample is f. We can thus transform our set of frequencies f 1,..., f k into a set of non-negative frequency counts a 1,..., a n : M f=1 fa f = M. As many f 1,..., f k can map to the same a 1,..., a n, this transformation is not invertible, but it is unique from a 1,..., a n to f 1,..., f k. Having transformed a set of frequencies over alleles to a set of frequency counts over frequencies, we can now use Ewens sampling formula to approximate P (f 1,..., f k ) given an average rate of polymorphism θ between copies of the locus within each sample: P (f 1,..., f k ) = P (a 1,..., a n ) = M! θ M 1 z=1 (θ + z) M j=1 θ a j j a j aj! In the bi-allelic case in which two alleles are present at this locus in our population: (9) M! θ P (a 1,..., a n ) = M 1 z=1 (θ + z) (10) f 1 f 2 In the monomorphic case, where only a single allele is represented at this locus in our population: M! P (a 1,..., a n ) = M 1 z=1 (θ + z) (11) As we see, P (f 1,..., f k ) approximates 1 θ for the monomorphic case, which is sensible as θ represents the expected mutation rate. 3 Direct detection of phase from short-read sequencing This exact Bayesian model to detect multi-allelic polymorphic loci provides the foundation for the direct detection of longer, multi-base alleles from sequence trace alignments. Here we describe the implementation of a haplotype-based variant detection method and the manner in which it is different from existing SNP and indel detection methods. 5

Our method is reference-guided, using alignments of short-read sequence traces to a reference genome as input. Anchoring sequence surrounding dynamically-determined windows containing likely polymorphic sequence. Any number of samples may be analyzed simultaneously. As the number of input samples increases, the detection method gains power to differentiate true, non-paralogous polymorphic loci from loci which appear to be polymorphic but are in fact artifacts of sequencing and alignment. Moreover, the exact nature of the Bayesian model ensures that rare variants are not suppressed in reported probability as the number of samples under analysis increases. 3.1 Parsing haplotypic allele observations from sequencing data In order to establish a range of sequence in which multiple polymorphisms segregate in the population under analysis, it is necessary to first determine potentially polymorphic windows in order to bound the analysis. This determination is complicated by the fact that a strict windowing can inappropriately break haplotypic structures into multiple variant calls. We employ a dynamic windowing approach that is driven by the observation of multiple proximal reference-relative variations (SNPs and indels) in input alignments. Where these variations are separated by less than a configurable number of non-polymorphic bases, our method combines them into a single haplotype allele observation, H i. The observational quality of these haplotype alleles is given as min(q l b i H i, Q i ), or the minimum base quality of the allele observations in the haplotypic allele observation from the sequence trace and the mapping quality. 3.2 Determining a window over which to assemble haplotype observations At each position in the reference, we collect allele observations. To improve performance, we apply a set of input filters to exclude alleles from the analysis which are highly unlikely to be true. These filters require a minimum number of alternate observations and a minimum sum of base qualities in a single sample in order to incorporate a putative allele and its observations into the analysis. We then determine a haplotype length over which to genotype samples by a bounded iterative process. We fist determine the allele passing the input filters which is longest relative to the reference. Then, we parse haplotype observations from all the alignments which fully overlap this window, finding the rightmost end of the longest haplotype allele which begins within the window. This rightmost position is used to update the haplotype window. We then repeat the process until the window is not overlapped by any haplotype observations. This method is guaranteed to converge given finite read length and the requirement that only reads which fully overlap the detection window are used in the analysis. 3.3 Detection and genotyping of local haplotypes Once a window for analysis has been determined, we parse all fully-overlapping reads into haplotype observations which are anchored by the ends of the window. Given these sets of sequencing observations r i1,..., r isi = R i and data likelihoods P (R i G i ) for each sample 6

and genotype combination, we then determine the probability of polymorphism at the locus given the Bayesian model described in section 2. To establish a maximum a posteriori estimate of the genotype for each sample, we employ a convergent gradient ascent approach to the posterior probability distribution over the mutual genotyping across all samples under our Bayesian model. This process begins at the genotyping across all samples G 1,..., G n where each sample s genotype is the maximumlikelihood genotype given the data likelihood P (R i G i ): G 1,..., G n = argmax P (R i G i ) := {G i G j G 1,..., G n : P (R i G i ) >= P (R i G j )} (12) G i The posterior search then attempts to find a mutual genotyping in the local space of genotypings which has higher posterior probability under the model than this initial mutual genotyping. In practice, this step is done by searching through all genotypings in which a single sample has up to the Nth best genotype when ranked by P (R i G i ), and N is a small number. This search starts with some set of genotypes G 1,..., G n = {G} and attempts to find a genotyping {G} such that: P ({G} R 1,..., R n ) > P ({G} R 1,..., R n ) (13) {G} is then used as a basis for the next update step. This search iterates until convergence, but in practice is bounded at a fixed number of steps due to performance reasons. As the quality of input data increases in coverage and confidence, this search will converge more quickly because the maximum-likelihood estimate will like closer to the maximum a posteriori estimate under the model. While this method of posterior search and normalization is inferior to posterior sampling methods such as Markov chain Monte Carlo in terms of accuracy, we employ it because it is highly performant and deterministic. This method incorporates a basic form of genotype imputation into the detection method, which in practice improves the quality of raw genotypes produced in primary allele detection and genotyping relative to methods which only utilize a maximum-likelihood method to determine genotypes [Li, 2011]. Furthermore, this method allows for the determination of marginal genotype likelihoods via the marginalization of assigned genotypes for each sample over the posterior probability distribution. References D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang, S. B. Jovanovich, P. S. Krstic, S. Lindsay, X. S. Ling, C. H. Mastrangelo, A. Meller, J. S. Oliver, Y. V. Pershin, J. M. Ramsey, R. Riehn, G. V. Soni, V. Tabard-Cossa, M. Wanunu, M. Wiggin, and J. A. Schloss. The potential and challenges of nanopore sequencing. Nat. Biotechnol., 26(10):1146 1153, Oct 2008. S. R. Browning and B. L. Browning. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet., 81(5):1084 1097, Nov 2007. 7

J. Clarke, H. C. Wu, L. Jayasinghe, A. Patel, S. Reid, and H. Bayley. Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol, 4(4):265 270, Apr 2009. O. Delaneau, J. Marchini, and J. F. Zagury. A linear complexity phasing method for thousands of genomes. Nat. Methods, 9(2):179 181, Feb 2012. M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. del Angel, M. A. Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, and M. J. Daly. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet., 43(5):491 498, May 2011. W. J. Ewens. The sampling theory of selectively neutral alleles. Theor Popul Biol, 3:87 112, Mar 1972. E. Garrison. FreeBayes source repository. https://github.com/ekg/freebayes, 2012. E. Garrison. FreeBayes. http://bioinformatics.bc.edu/marthlab/freebayes, 2011. B. Howie, J. Marchini, and M. Stephens. Genotype imputation with thousands of genomes. G3 (Bethesda), 1(6):457 470, Nov 2011. H. Li. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21): 2987 2993, Nov 2011. Y. Li, C. J. Willer, J. Ding, P. Scheet, and G. R. Abecasis. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol., 34 (8):816 834, Dec 2010. G. T. Marth, I. Korf, M. D. Yandell, R. T. Yeh, Z. Gu, H. Zakeri, N. O. Stitziel, L. Hillier, P. Y. Kwok, and W. R. Gish. A general approach to single-nucleotide polymorphism discovery. Nat. Genet., 23:452 456, Dec 1999. 8