GBS Bioinformatics Pipeline(s) Overview

Similar documents
GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline

Fei Lu. Post doctoral Associate Cornell University

Genotyping By Sequencing (GBS) Method Overview

Genotyping By Sequencing (GBS) Method Overview

Variant visualisation and quality control

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Accounting for read depth in the analysis of genotyping-by-sequencing data

High-throughput sequencing: Alignment and related topic

*: Division of Biological Sciences; University of Missouri; Columbia, MO, 65211

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Introduction to PLINK H3ABionet Course Covenant University, Nigeria

GTRAC FAST R ETRIEVAL FROM C OMPRESSED C OLLECTIONS OF G ENOMIC VARIANTS. Kedar Tatwawadi Mikel Hernaez Idoia Ochoa Tsachy Weissman

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Genotype Imputation. Biostatistics 666

BTRY 7210: Topics in Quantitative Genomics and Genetics

Introduction to Linkage Disequilibrium

Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Case-Control Association Testing. Case-Control Association Testing

Molecular characterization of CIMMYT maize inbred lines with genotyping by sequencing SNPs

Predictive Genome Analysis Using Partial DNA Sequencing Data

SNPs versus sequences for phylogeography an explora:on using simula:ons and massively parallel sequencing in a non- model bird

Explore SNP polymorphism data. A. Dereeper, Y. Hueber

Genomes Comparision via de Bruijn graphs

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Supporting Information

Supplementary Information for Discovery and characterization of indel and point mutations

opulation genetics undamentals for SNP datasets

Comparative Genomics of Fagaceae

Is KIT locus polymorphism rs related to white belt phenotype in Krškopolje pig?

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

A Browser for Pig Genome Data

Isoform discovery and quantification from RNA-Seq data

Comparing whole genomes

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

High-throughput sequence alignment. November 9, 2017

Calculation of IBD probabilities

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Read Quality Assessment & Improvement. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Genotyping-by-sequencing provides the discriminating power to investigate the subspecies of Daucus carota (Apiaceae)

Supplementary Figure 1. Phenotype of the HI strain.

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Calculation of IBD probabilities

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Genotype Imputation. Class Discussion for January 19, 2016

Cycle «Analyse de données de séquençage à haut-débit»

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Heterozygous BMN lines

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Linear Regression (1/1/17)

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Single Cell Sequencing

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Bias in RNA sequencing and what to do about it

Microsatellite evolution in Adélie penguins

Supporting Information

RNA- seq read mapping

Multivariate analysis of genetic data an introduction

Orthologs Detection and Applications

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

NCEA Level 2 Biology (91157) 2017 page 1 of 5 Assessment Schedule 2017 Biology: Demonstrate understanding of genetic variation and change (91157)

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

1. Understand the methods for analyzing population structure in genomes

Fine Mapping and Candidate Gene Characterization of the Pepper Bacterial Spot Resistance Gene bs6

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Introduction to the SNP/ND concept - Phylogeny on WGS data

Whole Genome Alignments and Synteny Maps

Maize Genetics Cooperation Newsletter Vol Derkach 1

Objectives. Announcements. Comparison of mitosis and meiosis

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database

Whole-genome amplification in doubledigest RADseq results in adequate libraries but fewer sequenced loci

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

"Omics" - Experimental Approachs 11/18/05

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

How to analyze many contingency tables simultaneously?

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

Introduction to population genetics & evolution

Homework 1.1 and 1.2 WITH SOLUTIONS

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1. HyperLogLog algorithm

Meiosis & Sexual Reproduction

Mendelian Genetics And Meiosis Study Guide

Tree Building Activity

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Introduction to de novo RNA-seq assembly

Bayesian Clustering of Multi-Omics

Labs 7 and 8: Mitosis, Meiosis, Gametes and Genetics

Lecture 3: Markov chains.

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

InGen: Dino Genetics Lab Post-Lab Activity: DNA and Genetics

What is a sex cell? How are sex cells made? How does meiosis help explain Mendel s results?

Variance Component Models for Quantitative Traits. Biostatistics 666

Transcription:

GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the coders.

Three Pipelines Discovery Pipeline Requires a reference genome Multiple steps to get to genotypes Hands on tutorial is based on this pipeline Production Pipeline Uses information from Discovery Pipeline One step from sequence to genotypes UNEAK Pipeline For species without a reference genome Fei Lu will present this tomorrow at 9:30

Vocabulary Sequence File Text file containing DNA sequence and supplemental information from the Illumina Platform. Key File Text file used to assign a GBS Bar Code to a Taxa GBS Tag DNA sequence consisting of a cut site remnant and additional sequence. GBS Bar Code A short known sequence of DNA used to assign a GBS Tag to its original Taxa Taxa An individual sample

GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

Raw Sequence (Qseq) HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGT HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAA HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAA HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCT HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAA HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGC HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGC HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATT HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAG HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTT HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTT HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCAT HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACT HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAG HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCA HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAA HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGT HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTC HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTG

Raw Sequence (Qseq) HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGT HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAA HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAA HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCT HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAA HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGC HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGC HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATT HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAG HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTT HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTT HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCAT HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACT HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAG HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCA HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAA HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGT HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTT HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTG

Key File Flowcell Lane Barcode DNASample LibraryPlate Row Column LibraryPrepID PlateName 81PVTABXX 2 CTCC Sample_1 1 A 1 1 Plate_A 81PVTABXX 2 TGCA Sample_2 1 A 2 2 Plate_A 81PVTABXX 2 ACTA Sample_3 1 A 3 3 Plate_A 81PVTABXX 2 CAGA Sample_4 1 A 4 4 Plate_A 81PVTABXX 2 AACT Sample_5 1 A 5 5 Plate_A 81PVTABXX 2 GCGT Sample_6 1 A 6 6 Plate_A 81PVTABXX 2 TGCGA Sample_7 1 A 7 7 Plate_A 81PVTABXX 2 CGAT Sample_8 1 A 8 8 Plate_A 81PVTABXX 2 CGCTT Sample_9 1 A 9 9 Plate_A 81PVTABXX 2 TCACC Sample_10 1 A 10 10 Plate_A 81PVTABXX 2 CTAGC Sample_11 1 A 11 11 Plate_A 81PVTABXX 2 ACAAA Sample_12 1 A 12 12 Plate_A 81PVTABXX 2 TTCTC Sample_13 1 B 1 13 Plate_A 81PVTABXX 2 AGCCC Sample_14 1 B 2 14 Plate_A 81PVTABXX 2 GTATT Sample_15 1 B 3 15 Plate_A 81PVTABXX 2 CTGTA Sample_16 1 B 4 16 Plate_A 81PVTABXX 2 ACCGT Sample_17 1 B 5 17 Plate_A 81PVTABXX 2 GTAA Sample_18 1 B 6 18 Plate_A 81PVTABXX 2 GGTTGT Sample_19 1 B 7 19 Plate_A

GBS Tags Barcode adapter Cut site Insert Cut site Common adapter Good read Barcod e Cut site Rejected reads Insert Cut site Insert No Barcode Barcod e Barcod e Insert Cut site Trimmed reads Common adapter No Cut site Adapter dimer Barcod e Barcod e Cut site Cut site Insert Insert Cut site Cut site 2 nd Insert Common adapter Chimeric (?) sequence Short sequence

GBS Tags Barcode adapter Cut site Insert Cut site Common adapter Good read Barcod e Cut site Rejected reads Insert Cut site Insert No Barcode Barcod e Barcod e Insert Cut site Trimmed reads Common adapter No Cut site Adapter dimer Barcod e Barcod e Cut site Cut site Insert Insert Cut site Cut site 2 nd Insert Common adapter Chimeric (?) sequence Short sequence

GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

Tag Counts With information from the key file, each sequence file is processed, tags are identified and counted. If a tag is shorter than 64 bases it is padded. The tags and counts are put into a tag count file for each sequence file. QseqToTagCountsPlugin / FastqToTagCountsPlugin

Master Tag Counts The individual tag count files are merged into a master tag count file. A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors). MergeMultipleTagCountsPlugin

Conversion of Tags to Fastq Sequence aligners do not work with the tag count file format. In preparation for the alignment step, the tag count file is converted to fastq format. TagCountsToFastqPlugin

GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

Tag Alignment / TOPM The GBS pipeline uses an external aligner to do the initial alignment. The current version uses bowtie2 which produces the alignment in the SAM format. bowtie2 We convert the SAM file into our tags on physical map format (TOPM) SAMConverterPlugin

TOPM

So Far We Have Identified and counted GBS tags. Converted tag counts file to fastq. Aligned the tags to a reference. Converted the alignment to TOPM.

GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

Tags by Taxa In this step we identify which tags are present in which taxa. Original Sequence Files Key File Master Tag Count File Recently migrated to HDF5 file format. Efficient storage Large data sets SeqToTBTHDF5Plugin

Tags By Taxa Additional Operations If many TBTs have been created they are merged into 1 TBT. Taxa that were sequenced multiple times are merged. The TBT table is pivoted in preparation for SNP calling. ModifyTBTHDF5Plugin

GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

SNP Calling Files used in SNP Calling TOPM TBT Pedigree File (optional) Some Key Settings mnf MinimumF (inbreeding coefficient) mnmaf Minimum Minor Allele Frequency mnmac Minimum Minor Allele Count mnlcov Minimum Locus Coverage TagsToSNPByAlignmentPlugin

HapMap rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 S1_2100 A/G 1 2100 + N N N N N N N R N A N S1_2163 T/C 1 2163 + N N N N N N T C T T N S1_13837 T/G 1 13837 + N N N N N N N G N N T S1_14606 C/T 1 14606 + N N C N N N T T T T C S1_2061 T/A 1 20601 + T N N N N N N A N N N S1_68332 C/T 1 68332 + N N N N N N N N N N N S1_68596 A/T 1 68596 + A N N N N N N N N A N S1_69309 G/A 1 69309 + N G N N N N N A N N N S1_79955 T/G 1 79955 + N T G T T N T T N N N S1_79961 T/G 1 79961 + N T T T T N T T N N N S1_80584 G 1 80584 + N N N N N N N N N N G S1_80647 C/T 1 80647 + N N N N N N N C N N C S1_81274 T/G 1 81274 + N N N N N N T G N N N S1_108834 G/A 1 108834 + N N N N N N N N N N N S1_112345 T/G 1 112345 + N N N N N N K T N N N S1_115359 C/T 1 115359 + N N N N N N T C N T S1_115362 T/C 1 115362 + N N N N N N N C N N N S1_115405 G/A 1 115405 + G G A N N G G G G N S1_115516 T/G 1 115516 + N N T N N N T T N N T S1_116694 A/G 1 116694 + N A G N N N G A N N N S1_119016 C/T 1 119016 + N N N N C N N C N N N S1_155366 T/C 1 155366 + N T N N N N

Production Pipeline

Why another pipeline? The last maize build (30000 taxa) with the discovery pipeline took over 3 months. Most common alleles have been identified after the first few discovery builds. Use the information from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate.

Discovery Fastq GBS Discovery pipeline Tags by Taxa Tag Counts TOPM SNP Caller

Discovery Fastq GBS Discovery pipeline Tags by Taxa Tag Counts TOPM SNP Caller Filtered

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller

Discovery Fastq Production Fastq Tags by Taxa Tag Counts TOPM TagsOnPhysicalMap (TOPM) SNP Caller

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Filtered

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Filtered

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Filtered

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Filtered

Running the Production Pipeline Required Files: Sequence file (fastq or qseq) Key file Production TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a time HapMap files by chromosome ~7 minutes

Testing Production Pipeline Compared HapMap files produced by Discovery Pipeline and Production Pipeline Site Comparison: Discovery 48,139 Production 47,676 Difference due to maximum 8 alleles 99.98% correlation of genetic distance matrices

Next Steps In Pipeline Development Hierarchical Data Format supports very large data sets and complex data structures. Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository. Continued improvements to SNP caller. Ability to use tags not present in the reference.