Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Similar documents
Computational Structural Bioinformatics

CGS 5991 (2 Credits) Bioinformatics Tools

Whole Genome Alignments and Synteny Maps

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Procedure to Create NCBI KOGS

Supplementary Information for Discovery and characterization of indel and point mutations

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl

Small RNA in rice genome

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Graph Alignment and Biological Networks

ECOL/MCB 320 and 320H Genetics

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Welcome to BIOL 572: Recombinant DNA techniques

microrna Studies Chen-Hanson Ting SVFIG June 23, 2018

Genome Sequencing & DNA Sequence Analysis

Comparing Genomes! Homologies and Families! Sequence Alignments!

SUPPLEMENTARY INFORMATION

2 Genome evolution: gene fusion versus gene fission

Advanced Algorithms and Models for Computational Biology

Gene mention normalization in full texts using GNAT and LINNAEUS

G4120: Introduction to Computational Biology

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

An efficient alignment algorithm for masked sequences

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Briefings in Bioinformatics Advance Access published May 29, 2012

7 Multiple Genome Alignment

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

BSC 4934: QʼBIC Capstone Workshop" Giri Narasimhan. ECS 254A; Phone: x3748

Supporting information for Demographic history and rare allele sharing among human populations.

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Supplementary Figure 1. Phenotype of the HI strain.

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Sequence Database Search Techniques I: Blast and PatternHunter tools

A Browser for Pig Genome Data

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

De novo assembly and genotyping of variants using colored de Bruijn graphs

RNA- seq read mapping

Paired-End Read Length Lower Bounds for Genome Re-sequencing

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Reducing storage requirements for biological sequence comparison

CRISPRseek Workshop Design of target-specific guide RNAs in CRISPR-Cas9 genome-editing systems

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

"Omics" - Experimental Approachs 11/18/05

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

SUPPLEMENTARY INFORMATION

Introduction to protein alignments

In-Depth Assessment of Local Sequence Alignment

Practical considerations of working with sequencing data

Unit 5: Cell Division and Development Guided Reading Questions (45 pts total)

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Characterization of New Proteins Found by Analysis of Short Open Reading Frames from the Full Yeast Genome

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Gibbs Sampling Methods for Multiple Sequence Alignment

Local Alignment: Smith-Waterman algorithm

Fei Lu. Post doctoral Associate Cornell University

Supplementary Information

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Variant visualisation and quality control

Bias in RNA sequencing and what to do about it

Supporting Information

You are required to know all terms defined in lecture. EXPLORE THE COURSE WEB SITE 1/6/2010 MENDEL AND MODELS

MCB142/ICB163 Overview. Instructors:

Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section B and ONE question from Section C.

Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY

High-throughput sequence alignment. November 9, 2017

Comparative studies on sequence characteristics around translation initiation codon in four eukaryotes

Genomes Comparision via de Bruijn graphs

PRACTICE TEST ANSWER KEY & SCORING GUIDELINES INTEGRATED MATHEMATICS I

Comparing whole genomes

Motivating the need for optimal sequence alignments...

Linear-Space Alignment

Peeling the yeast protein network

Fitness landscapes and seascapes

Tandem Mass Spectrometry: Generating function, alignment and assembly

Course: Visual Analytics of largescale biological data. Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

Single alignment: Substitution Matrix. 16 march 2017

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Genome-Wide Computational Prediction and Analysis of Core Promoter Elements across Plant Monocots and Dicots

GEP Annotation Report

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Chapter 18 Active Reading Guide Genomes and Their Evolution

SUPPLEMENTARY MATERIAL. The SMM discovers regions of local similarity between DNA sequences without

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Genome Evolution Greg Lang, Department of Biological Sciences

Similarity Analysis between Transcription Factor Binding Sites by Bayesian Hypothesis Test *

Universal Rules Governing Genome Evolution Expressed by Linear Formulas

Frequency Spectra and Inference in Population Genetics

GBS Bioinformatics Pipeline(s) Overview

Heuristic Methods. Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction

Introduction to Bioinformatics

Transcription:

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The Pyrobayes base calling approach. Supplementary Figure 2. Estimation of data likelihoods with parent distributions. Supplementary Figure 3. Concordance of base errors in the Pyrobayes and the native 454 base calls. Supplementary Figure 4. Distribution of base quality scores within homopolymer runs. Supplementary Figure 5. Base quality accuracy for the 454 Life Sciences FLX model. Supplementary Methods 1 of 1

SUPPLEMENTARY FIGURE 1. The PYROBAYES base calling approach. Supplementary Figure 1. The PYROBAYES base calling approach. a, Data likelihoods: The distributions of measured nucleotide incorporation signal intensity values for various homopolymer lengths. b, Priors: The observed frequencies of homopolymer lengths in eight different organismal genome sequences. The theoretical expectation of exponential decay is also included. c, Bayesian posteriors: The posterior probabilities of homopolymer lengths 0 5 are shown, calculated from the data likelihoods (panel a) and the prior probabilities (panel b). d, The most likely number of bases is shown for eight consecutive nucleotide tests. The base quality value assigned to each called base is the probability that the base in question was not an overcall.

SUPPLEMENTARY FIGURE 2. Estimation of data likelihoods with parent distributions. Supplementary Figure 2. Estimation of data likelihoods with parent distributions. The observed distributions of the incorporation intensity signal for various homopolymer lengths are marked with black x-s: (a) Length = 1. (b) Length = 2 (c) Length = 3. (d) Length = 4. The approximating (non-central Student s t) distributions are also indicated in red.

SUPPLEMENTARY FIGURE 3. Concordance of base errors in the PYROBAYES and the native 454 base calls. Supplementary Figure 3. Concordance of base errors in the PYROBAYES and the native 454 base calls. (a) Fraction of errors shared between the two methods and the fractions unique to each method. Other errors indicate cases where both methods made an error but not the same error type (b) Distribution of errors according to error type made solely by PYROBAYES. (c) Distribution of errors according to error type made solely by the native 454 base caller. (d). Distribution of other errors i.e. cases where both methods made an error but the error types were different.

SUPPLEMENTARY FIGURE 4. Distribution of base quality scores within homopolymer runs. Supplementary Figure 4. Distribution of base quality scores within homopolymer runs. (a) The average quality score assigned by PYROBAYES is shown, as a function of the base position within the homopolymer, for homopolymer runs up to five bases. For example, the average base quality value assigned to the second base in a homopolymer run of four nucleotides is 43. (b) The directionality of quality values in homopolymer runs is shown for two reads aligned to a reference sequence (bold). The top read is aligned in forward (+) orientation and the bottom read is aligned in reverse orientation (-). The quality values are shown beneath the homopolymer for each aligned read.

SUPPLEMENTARY FIGURE 5. Base quality accuracy for the 454 Life Sciences FLX model. Supplementary Figure 5. Base quality accuracy for the 454 Life Sciences FLX model. Assigned base quality values (x axis) are compared to the quality values calculated from measured read accuracy (y axis). PYROBAYES base calls: red; native 454 base calls: gray.

SUPPLEMENTARY METHODS Determination of base number probabilities. We determined the frequency that, given an observed signal from a nucleotide test, the actual number of incorporated bases was 1,2,3..,etc. by aligning 454 reads from a mouse BAC to the known BAC reference sequence. Since we are able to conclude that any observed mismatch between a 454 read and the BAC reference was a sequencing error, the observed frequencies serve as estimates for the data probabilities, Pr( s n) in our Bayesian framework, where s is the observed nucleotide incorporation signal and n is the homopolymer length. The 661,481 reads 454 reads (a total of 65,875,710 bases) were aligned to the BAC reference sequence with our novel, reference-sequence guided alignment / assembler software, MOSAIK (manuscript in preparation). Estimating the prior homopolymer probabilities. According to a model of random aggregation of consecutive bases in a DNA sequence, the expectation for the frequency of the four possible homopolymers of length n is proportional to 1/4 n. To check the validity of this expectation we computed the frequency of observed homoploymers in the genomes of eight species (Escherichia coli K12, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Oryza sativa, Arabidopsis thaliana, Mus musculus, and Homo sapiens). As Supplementary Figure 1b shows this random expectation grossly underestimates the actual frequency of longer homopolymers in the genomes we analyzed. For the prior probability values, Pr ior( n), we used the average frequency of the eukaryote homopolymer frequencies. The prior probability for a homopolymer of length zero (i.e. the prior likelihood that no base is incorporated in a flow) was tabulated by counting what fraction of the nucleotide tests do not correspond to an actual base in a single run of 454 reads from a mouse BAC shotgun library. Determination of the most likely number of bases. Using the data likelihoods and the prior probabilities, we determined P( n s), i.e. the base number probabilities, for every possible (up to a rational limit of n=100) base number, according to the following formula: Pr( s n) Prior( n) Pr( n s) =. The number n for which this posterior probability Pr( s ni) Prior( ni) every possible ni is highest is the most likely number of bases. Parent distribution fitting to estimate the frequency of longer homopolymer runs. Longer homopolymeric runs are inherently less frequent than short runs. Regardless of the amount of available testing data, the number of examples for long homopolymeric runs will always be too small for reliable frequency estimation. In the case of our own test data set, there was not sufficient data to estimate the frequency of homopolymers longer than seven-nucleotides. In order to both extend base calling to longer homopolymeric runs and to improve runtime, we replaced the observed data likelihoods with appropriate parent distributions. For this purpose we used non-central Student s t distributions (Supplementary Figure 2). The non-central t fit parameters for shorter homopolymers were extrapolated to longer homopolymer distributions. Base called sequence generation. The base called sequence is produced by concatenating the most likely number of bases from each nucleotide test (adding zero bases in the most likely number is zero). Additionally, we call an extra base if it s posterior probability is above 0.5. Base quality assignment. The base quality value assigned to each base represents the probability that the base is question was, in fact, incorporated within the test. For example, if the most likely number of nucleotides for an observed signal is three, then the base quality value for the first nucleotide in the run of three reflects the probability that, based on the signal, at least one nucleotide was incorporated. Similarly, the second base reflects the probability that there were at 1 of 2

least two nucleotides incorporated, and so forth. Consequently, the bases with the highest assigned base qualities come from the first bases in longer homopolymer runs (Supplementary Figure 1d, Supplementary Figure 4a). The highest base quality value that PYROBAYES assigns is 50, which represents n expected error rate of 1 in 100,000. Nominal base quality values above this value are reduced to 50. Sequence alignment. To align next-generation sequencer reads to entire reference genome sequences efficiently and accurately, we developed a new alignment and assembly algorithm, MOSAIK. MOSAIK uses a hash-based approach 18-20 for a fast initial read placement, followed by an exhaustive local Smith-Waterman-Gotoh 21 pair-wise alignment. Each read was placed at the location where it best aligned, as long as minimum alignment criteria were met (> 95% aligned length; > 95% sequence identity). To avoid misalignment due to paralogy, however, if the read can be mapped to another location with a Smith-Waterman alignment score that is at least 80% of the best alignment score, the read is rejected from the alignment altogether. Overall sequence error rates. We aligned the Drosophila melanogaster iso-1 reads (GS20 model) base called by both PYROBAYES and the native 454 base-caller to the Drosophila reference genome sequence. Only counting reads that could be uniquely aligned with our stringent alignment criteria, over 20Mb of sequence was aligned for each method. The sequence differences between every aligned read and the reference sequence were tabulated and used to compute the overall, insertion, deletion and substitution error rates for each method. Ns called by the 454 method were counted as substitutions, since PYROBAYES always calls a base in cases where the 454 method would call an N. Measured base quality value calculation. Using the same Drosophila melanogaster iso-1 reads aligned to the reference genome, the measured (i.e. actual ) base quality for each assigned base quality was derived by tabulating the number of correct and incorrect base calls made with a given assigned base quality. We then computed an actual base quality for the assigned base #correct quality, according to the following formula: Q = 10 log 10( 1 ( )) SNP calling and validation. SNPs were called among base-called sequences from an inbred African isolate of Drosophila melanogaster by both PYROBAYES and the native 454 base caller. The resulting sequences from each method were aligned to the Drosophila genome reference sequence. We then used the POLYBAYES SNP discovery program 12 to call SNP candidates among the aligned sequences from each base caller. POLYBAYES used the base quality values from the respective base caller to assign a SNP likelihood to all observed mismatches between the aligned read and the reference sequence. We submitted the SNPs identified from the sequences called by PYROBAYES to experimental validation by PCR-based capillary sequencing. We computed the validation rate from the fraction of SNPs that we identified by PYROBAYES but not confirmed by the capillary validation sequences. total Missed SNP rate calculation. We estimated the missed SNP rate using the capillary sequences produced for candidate SNP validation. First, we used the POLYPHRED program 13 to call SNPs in the capillary validation traces from the 46-2 isolate. Second, we manually inspected these SNPs and excluded obvious false positives. We treated the remaining POLYPHRED calls as true SNPs. Third, for each true SNP, we determined if it was also covered by a base-called read from the same isolate. We counted cases where there was coverage by a base-called read but the SNP was not discovered by POLYBAYES as missed SNPs. We calculated the missed SNP rate as the number of missed SNPs divided by the number of true SNPs for which there was coverage by a base-called read. 2 of 2