GBS Bioinformatics Pipeline(s) Overview

Size: px

Start display at page:

Download "GBS Bioinformatics Pipeline(s) Overview"

Lynette Shanna Knight
6 years ago
Views:

1 GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from the coders.

2 Philosophy of the GBS Pipeline Why develop our own pipeline? Efficiency Only align unique fragments once. Data structures specific to these data types.

3 Three Pipelines Discovery Pipeline Requires a reference genome Multiple steps to get to genotypes Hands on tutorial is based on this pipeline Production Pipeline Uses information from Discovery Pipeline One step from sequence to genotypes UNEAK Pipeline For species without a reference genome Fei Lu will present this tomorrow at 9:30

4 Vocabulary Sequence File Text file containing DNA sequence and supplemental information from the Illumina Platform. Taxa An individual sample Key File Text file used to assign a GBS Bar Code to a Taxa GBS Tag DNA sequence consisting of a cut site remnant and additional sequence. GBS Bar Code A short known sequence of DNA used to assign a GBS Tag to its original Taxa

5 GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

6 GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

7 Raw Sequence (Qseq) HWI-ST GTCGATTCTGCTGACTTCATGGCTTCTGTTGACG HWI-ST GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGT HWI-ST ATGTACTGCACCGTTGCAAGCGAGCACCACCAA HWI-ST CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAA HWI-ST GATTTTACTGCACATCGGTCTTGTCACACCAGCT HWI-ST TCACCCAGCATCACGCCCCTTCACATCCAGTAAA HWI-ST CTTGACTGCCACCATGAATATGTGTTCCAAGTGC HWI-ST CCACAACTGCTCCATCTTTTCCATGAGACATTGC HWI-ST GTATTCTGCACACGAATCAGCTGAGACACCAATT HWI-ST AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAG HWI-ST CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTT HWI-ST TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTT HWI-ST GGACCTACTGCCCAAGAACGGCTCACCCATCAT HWI-ST GAGAATCAGCGTGTACGGGGCACGGGGTGACT HWI-ST TTCTCCAGCCGCATGGGCCGGAGACCAGAGAG HWI-ST GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCA HWI-ST TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAA HWI-ST GGACCTACTGCCGGCGGGACGAAAGCGGTTGT HWI-ST CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTC HWI-ST CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTG

8 Raw Sequence (Qseq) HWI-ST GTCGATTCTGCTGACTTCATGGCTTCTGTTGACG HWI-ST GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGT HWI-ST ATGTACTGCACCGTTGCAAGCGAGCACCACCAA HWI-ST CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAA HWI-ST GATTTTACTGCACATCGGTCTTGTCACACCAGCT HWI-ST TCACCCAGCATCACGCCCCTTCACATCCAGTAAA HWI-ST CTTGACTGCCACCATGAATATGTGTTCCAAGTGC HWI-ST CCACAACTGCTCCATCTTTTCCATGAGACATTGC HWI-ST GTATTCTGCACACGAATCAGCTGAGACACCAATT HWI-ST AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAG HWI-ST CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTT HWI-ST TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTT HWI-ST GGACCTACTGCCCAAGAACGGCTCACCCATCAT HWI-ST GAGAATCAGCGTGTACGGGGCACGGGGTGACT HWI-ST TTCTCCAGCCGCATGGGCCGGAGACCAGAGAG HWI-ST GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCA HWI-ST TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAA HWI-ST GGACCTACTGCCGGCGGGACGAAAGCGGTTGT HWI-ST CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTT HWI-ST CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTG

9 Key File Flowcell Lane Barcode DNASample LibraryPlate Row Column LibraryPrepID PlateName 81PVTABXX 2 CTCC Sample_1 1 A 1 1 Plate_A 81PVTABXX 2 TGCA Sample_2 1 A 2 2 Plate_A 81PVTABXX 2 ACTA Sample_3 1 A 3 3 Plate_A 81PVTABXX 2 CAGA Sample_4 1 A 4 4 Plate_A 81PVTABXX 2 AACT Sample_5 1 A 5 5 Plate_A 81PVTABXX 2 GCGT Sample_6 1 A 6 6 Plate_A 81PVTABXX 2 TGCGA Sample_7 1 A 7 7 Plate_A 81PVTABXX 2 CGAT Sample_8 1 A 8 8 Plate_A 81PVTABXX 2 CGCTT Sample_9 1 A 9 9 Plate_A 81PVTABXX 2 TCACC Sample_10 1 A Plate_A 81PVTABXX 2 CTAGC Sample_11 1 A Plate_A 81PVTABXX 2 ACAAA Sample_12 1 A Plate_A 81PVTABXX 2 TTCTC Sample_13 1 B 1 13 Plate_A 81PVTABXX 2 AGCCC Sample_14 1 B 2 14 Plate_A 81PVTABXX 2 GTATT Sample_15 1 B 3 15 Plate_A 81PVTABXX 2 CTGTA Sample_16 1 B 4 16 Plate_A 81PVTABXX 2 ACCGT Sample_17 1 B 5 17 Plate_A 81PVTABXX 2 GTAA Sample_18 1 B 6 18 Plate_A 81PVTABXX 2 GGTTGT Sample_19 1 B 7 19 Plate_A

10 Fragment from GBS library: GBS Tags Barcode adapter Cut site Insert Cut site Common adapter Good reads: (only the first 64 bases after the barcode are kept) typical read: Barcode Cut site Insert (first 64 bases) short fragment: Barcode Cut site Insert (<64bp) Cut site Common adapter chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site 2 nd Insert

11 Fragment from GBS library: GBS Tags Barcode adapter Cut site Insert Cut site Common adapter Good reads: (only the first 64 bases after the barcode are kept) typical read: Barcode Cut site Insert (first 64 bases) short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site

12 Fragment from GBS library: GBS Tags Barcode adapter Cut site Insert Cut site Common adapter Good reads: (only the first 64 bases after the barcode are kept) typical read: Barcode Cut site Insert (first 64 bases) short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site Rejected reads: Barcode Cut site Common adapter Not matching barcode and cut site remnant Contains N in first 64 bases after the barcode adapter dimer

13 GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

14 Tag Counts With information from the key file, each sequence file is processed, tags are identified and counted. If a tag is shorter than 64 bases it is padded. The tags and counts are put into a tag count file for each sequence file. QseqToTagCountsPlugin / FastqToTagCountsPlugin

15 Master Tag Counts The individual tag count files are merged into a master tag count file. A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors). MergeMultipleTagCountsPlugin

16 Conversion of Tags to Fastq Sequence aligners do not work with the tag count file format. In preparation for the alignment step, the tag count file is converted to fastq format. TagCountsToFastqPlugin

17 GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

18 Tag Alignment / TOPM The GBS pipeline uses an external aligner to do the initial alignment. The current version uses bowtie2 which produces the alignment in the SAM format. bowtie2 We convert the SAM file into our tags on physical map format (TOPM) SAMConverterPlugin

19 TOPM

20 So Far We Have Identified and counted GBS tags. Converted tag counts file to fastq. Aligned the tags to a reference. Converted the alignment to TOPM.

21 GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

22 Tags by Taxa In this step we identify which tags are present in which taxa. Original Sequence Files Key File Master Tag Count File Recently migrated to HDF5 file format. Efficient storage Large data sets SeqToTBTHDF5Plugin

23 Tags By Taxa Additional Operations If many TBTs have been created they are merged into 1 TBT. Taxa that were sequenced multiple times are merged. The TBT table is pivoted in preparation for SNP calling. ModifyTBTHDF5Plugin

24 GBS Discovery Pipeline Discovery Sequence Tags by Taxa Tag Counts TOPM SNP Caller

25 SNP Calling Files used in SNP Calling TOPM TBT Some Key Settings mnf MinimumF (inbreeding coefficient) mnmaf Minimum Minor Allele Frequency mnmac Minimum Minor Allele Count mnlcov Minimum Locus Coverage TagsToSNPByAlignmentPlugin

26 HapMap rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 S1_2100 A/G N N N N N N N R N A N S1_2163 T/C N N N N N N T C T T N S1_13837 T/G N N N N N N N G N N T S1_14606 C/T N N C N N N T T T T C S1_2061 T/A T N N N N N N A N N N S1_68332 C/T N N N N N N N N N N N S1_68596 A/T A N N N N N N N N A N S1_69309 G/A N G N N N N N A N N N S1_79955 T/G N T G T T N T T N N N S1_79961 T/G N T T T T N T T N N N S1_80584 G N N N N N N N N N N G S1_80647 C/T N N N N N N N C N N C S1_81274 T/G N N N N N N T G N N N S1_ G/A N N N N N N N N N N N S1_ T/G N N N N N N K T N N N S1_ C/T N N N N N N T C N T S1_ T/C N N N N N N N C N N N S1_ G/A G G A N N G G G G N S1_ T/G N N T N N N T T N N T S1_ A/G N A G N N N G A N N N S1_ C/T N N N N C N N C N N N S1_ T/C N T N N N N

27 Discovery Fastq GBS Discovery pipeline Tags by Taxa Tag Counts TOPM SNP Caller

28 Discovery Fastq GBS Discovery pipeline Tags by Taxa Tag Counts TOPM SNP Caller Filtered

29 Production Pipeline

30 Why another pipeline? The last maize build (30000 taxa) with the discovery pipeline took over 3 months. Most common alleles have been identified after the first few discovery builds. Use the information from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate.

31 GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller

32 Discovery Fastq Production Fastq Tags by Taxa Tag Counts TOPM TagsOnPhysicalMap (TOPM) SNP Caller

33 GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Filtered

34 GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Filtered

35 GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Filtered

36 GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Filtered

37 Running the Production Pipeline Required Files: Sequence file (fastq or qseq) Key file Production TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a time HapMap files by chromosome ~20 minutes

38 Testing Production Pipeline Compared HapMap files produced by Discovery Pipeline and Production Pipeline Site Comparison: Discovery 48,139 Production 47,676 Difference due to maximum 8 alleles 99.98% correlation of genetic distance matrices

39 Next Steps In Pipeline Development Hierarchical Data Format supports very large data sets and complex data structures. Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository. Continued improvements to SNP caller. Ability to use tags not present in the reference.

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the