GBS Bioinformatics Pipeline

Similar documents
GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview

Fei Lu. Post doctoral Associate Cornell University

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Genotyping By Sequencing (GBS) Method Overview

Genotyping By Sequencing (GBS) Method Overview

Genotype Imputation. Class Discussion for January 19, 2016

opulation genetics undamentals for SNP datasets

Accounting for read depth in the analysis of genotyping-by-sequencing data

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Processes of Evolution

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Genotype Imputation. Biostatistics 666

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Detecting selection from differentiation between populations: the FLK and hapflk approach.

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Introduction to PLINK H3ABionet Course Covenant University, Nigeria

Introduction to Sequence Alignment. Manpreet S. Katari

Linear Regression (1/1/17)

Tools and Algorithms in Bioinformatics

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

High-throughput sequencing: Alignment and related topic

Population Genetics I. Bio

Maize Genetics Cooperation Newsletter Vol Derkach 1

Explore SNP polymorphism data. A. Dereeper, Y. Hueber

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6

(Genome-wide) association analysis

Introduction to Linkage Disequilibrium

Haplotype-based variant detection from short-read sequencing

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

High-throughput sequence alignment. November 9, 2017

1. Understand the methods for analyzing population structure in genomes

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Supplementary Information for Discovery and characterization of indel and point mutations

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Chapter 13 Meiosis and Sexual Reproduction

Sequence analysis and Genomics

Heterozygous BMN lines

Predictive Genome Analysis Using Partial DNA Sequencing Data

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Linkage and Linkage Disequilibrium

Notes on Population Genetics

Computational Approaches to Statistical Genetics

Variant visualisation and quality control

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Whole Genome Alignments and Synteny Maps

2. Map genetic distance between markers

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture WS Evolutionary Genetics Part I 1

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Calculation of IBD probabilities

Microsatellite evolution in Adélie penguins

BIOLOGY 321. Answers to text questions th edition: Chapter 2

Sequence analysis and comparison

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Comparing whole genomes

*: Division of Biological Sciences; University of Missouri; Columbia, MO, 65211

Lecture 9. QTL Mapping 2: Outbred Populations

SNP Association Studies with Case-Parent Trios

Classical Selection, Balancing Selection, and Neutral Mutations

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Principles of QTL Mapping. M.Imtiaz

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Supporting Information

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.

Levels of genetic variation for a single gene, multiple genes or an entire genome

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

EXERCISES FOR CHAPTER 7. Exercise 7.1. Derive the two scales of relation for each of the two following recurrent series:

Microsatellite data analysis. Tomáš Fér & Filip Kolář

CNV Methods File format v2.0 Software v2.0.0 September, 2011

Week 7.2 Ch 4 Microevolutionary Proceses

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

The Quantitative TDT

Genetic diversity and population structure in rice. S. Kresovich 1,2 and T. Tai 3,5. Plant Breeding Dept, Cornell University, Ithaca, NY

Evolutionary Genetics Midterm 2008

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Meiosis and Mendel. Chapter 6

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Potato Genome Analysis

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

Transcription:

GBS Bioinformatics Pipeline...or, Where Your Data Go After Sequencing James Harriman Ed Buckler Jeff Glaubitz Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) 1

Non- Reference Genome Pipeline QseqToTagCount TagCounts per lane Merge TagsCounts Qseq Key files QseqToTBT TagsByTaxa files (1 per lane) Merge TagsByTaxa TagCounts for species (Master Tags) TagsByTaxa for species TagHomology PhaseNoAnchor HapMap Process File (data structure) Raw Sequence (Qseq) HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGC HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGC HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATT HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCA HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAG HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGT HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTG HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGA HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCC HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTC HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTT HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGC HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAA HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCA HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACG HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGA HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAG HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGC HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGC HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAA HWI-ST397 0 3 68 15765 200113 0 1 CCAGCTCAGCATGGATCTCTCCTTGATGGACTGAAAGCGCGTGTGCTCCCCTGTGTGATGGAAAGTGGCAGTG HWI-ST397 0 3 68 15912 200114 0 1 CCAGCTCAGCTCAAGCATTGGCTTCCGCTTTGGCATCCTGGAGGGTAAGCTTCTGCTCTTCTCACTAGAGGAG HWI-ST397 0 3 68 15791 200127 0 1 ACAAACAGCAGAGGTCGCATTGTAGTTAGTCCGGGACTTGCCCAGTTCATTGCTGAGATCGGAAGAGCGGTTC HWI-ST397 0 3 68 15831 200117 0 1 GCTCTACAGCTTCTGGCCAGAATGCTTTTGGCACTTGTTTGTCACAAAGCATGCACTGAACCATATTCATGATAG HWI-ST397 0 3 68 15848 200124 0 1 TTCTCCAGCTGCTACATGCACCGTGGGAAGAAGGTCTGCCCCACATACCCACCAGCCATCGCCCTTCTCACAT HWI-ST397 0 3 68 15891 200120 0 1 GAGATACAGCTGCGAATTGGGGGTTCCTGTGTTGCGAAGTGGCACTCGTGTGCCAAACTTGGCTACGCAGAGA HWI-ST397 0 3 68 15931 200128 0 1 AAAAGTTCAGCAATACCTGTTGAAGCCAAGCCCTTGTGGTGATTGCCTCGTTCATTGCTGCTGAGATCGGAAGA HWI-ST397 0 3 68 15991 200121 0 1 GAATCTGCTACTAGTGAGCCTTTGTATGGGGACCGAGTTCAGAAGCTCTAACCCTCGTTTTCCCATCTGCTGAG HWI-ST397 0 3 68 15765 200133 0 1 TAGCATGCCTGCTGCAGGAGTTGGTGCCCAGCATTCTCAGGTGTAGTCCAAATTCTGTCTGATACTTATTGTTTA HWI-ST397 0 3 68 15810 200133 0 1 TTCAGACAGATGATGCTTGTCAAGGGTCACCATCTTGCATTGCGCTGCGTCACATCCTTAGTGGGAATAGGGGA HWI-ST397 0 3 68 15871 200135 0 1 CTTGCTTCAGCCATGTAGAGTGGTGTTGCTCCTTTACTACCACGAATCATTGGTAACTCCCTGTTCTTATTCACC HWI-ST397 0 3 68 15974 200136 0 1 TTCAGACAGCCAAACGACGTCTTAGTGGAGAAAATACCTGAGAAAAGTCAAGAAACCAAAACACTAAAAAATGA HWI-ST397 0 3 68 15909 200147 0 1 AGCCTCAGCTTGGTTGCTTGTGGTTGGGGGTGAGGGGGCGGGCGGGAACTTATGTTTGCGCCCCGAGGCGG HWI-ST397 0 3 68 15946 200152 0 1 CTTGACTGGGCGTGGTGCTGAGGCTACTGCGGAATTGAGGTGTTGTCATCCACCGGATTGGGTCGTAGGGCG HWI-ST397 0 3 68 15774 200153 0 1 TTCAGACAGCCAACTGAGATGACTCTCATTCTTGGTAGGAACCAATTTCTGAGAGCTTCGTAATGACATCAACTA HWI-ST397 0 3 68 15814 200155 0 1 GAGATACAGCAACAAATGATGTCATTCCTTGCAAAAGCTGTACAAAGCCCTGGTTTCTTAGCTCAGCTGGTACAG HWI-ST397 0 3 68 15850 200154 0 1 GTGTTTGGTCGTGAAAGTGGACCTCTTTCAGGTGCAGGTGCGAGTAGAAGGAGGTCCCAGAGACGTGCGGCT HWI-ST397 0 3 68 15870 200157 0 1 GAGAAACCGCAGAATGATAGCAAAAAGCGCGTTACAGGAGATATTAAGAAAAGGAGACTTGCAATGCAGGAGTA HWI-ST397 0 3 68 15984 200158 0 1 CGTCAACTGCATGAAGGAGGTTGTCTGGCCGTTGGAGGAGTGATTTTGGAAGGCTGAGATCGGAAGAAAGGT HW 2

Assignment to Samples Barcode sequences from the plate map are compared to barcode sequences in the reads, in order to associate reads with the samples from which they originate. Parameters: Users supply a plate map and staff members supply DNA barcodes. These are combined into a table of barcodes by sample. Plate Map Project Details Sample Details Organism Detail Project Name Source Lab Plate Name Well Sample Name Pedigree Population Stock Number Sample BREAD Buckler BREAD-Maize-A A01 PI597982 inbred 04A0160A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A B01 blank 0 0 0 plantae BREAD Buckler BREAD-Maize-A C01 PI576130 inbred 04A0191B 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A D01 PI655991 inbred 04A0165A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A E01 PI656059 inbred 04A0193B 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A F01 CML91 inbred 04A0005BA 10 100 1000 Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A G01 CML311 inbred 04A0301A 10 100 1000 Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A H01 CML311 inbred 04A0200A 10 100 1000 Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A A02 MR_0011.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0281A 10 BREAD Buckler BREAD-Maize-A B02 MR_0013.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0279B 10 BREAD Buckler BREAD-Maize-A C02 MR_0014.3 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0164B 10 BREAD Buckler BREAD-Maize-A D02 MR_0015.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0163A 10 BREAD Buckler BREAD-Maize-A E02 MR_0016.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0315B 10 BREAD Buckler BREAD-Maize-A F02 MR_0018.2 (PI655994 x PI655998)S4 PI655994 x PI655998 02F146114A 10 BREAD Buckler BREAD-Maize-A G02 MR_0020.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0289B 10 BREAD Buckler BREAD-Maize-A H02 MR_0022.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0171A 10 BREAD Buckler BREAD-Maize-A A03 MR_0025.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0170B 10 BREAD Buckler BREAD-Maize-A B03 MR_0027.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0381B 10 BREAD Buckler BREAD-Maize-A C03 MR_0028.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0258A 10 BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0304B 10 Cacao Buckler BREAD-Maize-A E03 Tc1536 Catie F1 04A0216A 10 100 1000 Jemmy Takrama Cacao Buckler BREAD-Maize-A F03 Tc7959 Brazil F2 04A0255A 10 100 1000 Jemmy Takrama BREAD Buckler BREAD-Maize-A G03 PI542406 inbred 04A0217A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A H03 PI655981 inbred 04A0167A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A A04 PI656007 inbred 04P160451A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A B04 PI17548 inbred 04A0258B 10 100 1000 Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A C04 PI564163 inbred 04A0244A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A D04 PI651492 inbred 04A0298A 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A E04 PI656008 inbred 04A0293B 10 100 1000 Wenyan Zhu BREAD Buckler BREAD-Maize-A F04 PI655985 inbred 04A0296A 10 100 1000 Wenyan Zhu 3

Example DNA Barcode Key Flowcell Lane barcode sample Plate# Row Column PlateName 434GFAAXX 2 CTCC M0001 1 A 1 IBM1 1A01 434GFAAXX 2 TGCA M0012 1 A 2 IBM1 1A02 434GFAAXX 2 ACTA M0021 1 A 3 IBM1 1A03 434GFAAXX 2 GTCT M0029 1 A 4 IBM1 1A04 434GFAAXX 2 GAAT M0038 1 A 5 IBM1 1A05 434GFAAXX 2 GCGT M0046 1 A 6 IBM1 1A06 434GFAAXX 2 TGGC M0057 1 A 7 IBM1 1A07 434GFAAXX 2 CGAT M0067 1 A 8 IBM1 1A08 434GFAAXX 2 CTTGA M0080 1 A 9 IBM1 1A09 434GFAAXX 2 TCACC M0090 1 A 10 IBM1 1A10 434GFAAXX 2 CTAGC M0099 1 A 11 IBM1 1A11 434GFAAXX 2 ACAAA M0113 1 A 12 IBM1 1A12 434GFAAXX 2 TTCTC M0003 1 B 1 IBM1 1B01 434GFAAXX 2 AGCCC M0013 1 B 2 IBM1 1B02 434GFAAXX 2 GTATT M0022 1 B 3 IBM1 1B03 434GFAAXX 2 CTGTA M0030 1 B 4 IBM1 1B04 434GFAAXX 2 AGCAT M0039 1 B 5 IBM1 1B05 434GFAAXX 2 ACTAT M0047 1 B 6 IBM1 1B06 434GFAAXX 2 GAGAAT M0058 1 B 7 IBM1 1B07 434GFAAXX 2 CCAGCT M0068 1 B 8 IBM1 1B08 434GFAAXX 2 TTCAGA M0081 1 B 9 IBM1 1B09 434GFAAXX 2 TAGGAA unknown 1 B 10 IBM1 1B10 Notes on Names & Chromosomes Chromosomes (or contigs MUST be integers) Samples Names some Advice: NO spaces NO : Try to avoid weird characters. 4

Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) QSeqToTagCounts Processes a Qseq file so we know what alleles (tags) are present in the the sample Handles sequence quality issue Identifies the barcodes Removes problem tags Counts tags 5

GBS Restriction Fragment Structure Barcode adapter Cut site Read Cut site Common adapter Accepted read Barcode adapter Cut site Read Rejected or Trimmed reads Potential chimeric sequence Barcode adapter Cut site Read Cut site Sequence Short sequence Cut site Read Cut site Common adapter Adapter dimer Barcode adapter Cut site Common adapter Sequence Processing Raw sequence data is processed into unique 64-bp sequences. For example: CTCCCAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC GTTGAACAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC Becomes: CAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC 64 2 Parameters: Restriction enzyme Different enzymes will create different sequence motifs, such as overlapping cut sites, palindromes or wobble bases. Barcode Barcode sequences must be provided to identify acceptable reads. Number of identical sequences accepted This gives investigators the option to ignore repetitive sequences or singleton reads. 6

TagCounts File Number of Tags Max Size of Tag x 32bp Tag Sequence Count Length (bp) 26442466 2 CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT 64 1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT 64 2 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTTGGCACTCAAGCCCAAAACCACAGATCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGTAATTTGTTGTCTCATACCTCATACCACAGAACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTTTTCCAACCCCAAAACCCAAGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTTTCCCAAACCCCAAACCCCAGGCTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAGGGATAGGGAAGATGGGGGAGAGTGGCGGCCACGCATGGAA 64 1 CAGCAAAAAAAAAAAAAAAAAACAACAAGGAATTTGGGTATTCATTCCCCATACCCCAGGATTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACAAAAAAATTTGTTTTCTCAACCCCAAAACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 2 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGGAATTGAATCTCTCACACCTTAAAACACCGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAATTATTTGAAAGATCATTACCCTATACCACGGGGTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTGATGTCTCATACCCCATACCACAGGACTCCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTTATTTCTCATACCCCAAACCCCAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAGAATTTTATGTCTCATACCTCAAACCAAAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAATAAATTTGTTGCTCATACCCCAAACCACAGGGCTTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGCAATTTGATTCCACTTAATCTATCCCACAGAACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTGTTTCCCTAACCCCAAAACCACGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAATGAATTTGTAGTGCCAAACCCCAAACCAACGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCCAAGAAATTTGATGTCTCATACCCCAAACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAGACCAGGTAATTATTGCTCACATACATCAAACTCCAATTGCC 64 1 CAGCAAAAAAAAAAAAAAAAAAGCGCCTAACGTTTCAAAATGAATGAGTTGCCAACCAAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAGGGTTAGGAAAGATGGGTGGGAGGGGCGGGCCTGCTTGAAAT 64 1 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) 7

Unique Reads (FASTQ) @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=2 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC BWA (Burrows-Wheeler Aligner) Aligns the tags in FASTA format to the reference genome Parameters: Similarity of read sequence and genome sequence. This controls the tradeoff between number of SNPs and confidence in the alignment. Default is 4 edits per sequence. Gap penalty. This controls sensitivity to indels. Default is no indels within 5bp of the read ends. Outputs a SAM Alignment There are many other aligners. BWA is fast and memory efficient, but may not be appropriate for your species 8

Generic Alignment (SAM) length=64count=1 0 7 6994125 37 55M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count=2 0 7 6994125 37 54M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count=1 0 7 6994125 37 53M2I9M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count=1 0 7 6994125 37 54M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCT length=64count=1 0 7 6994125 37 55M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCT length=64count=4 0 7 6994125 37 4M3D47M2I11M * 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGA length=64count=1 16 17 14761759 25 64M * 0 0 CCTTTCTTGGCCTGGTTCTCACTCATCTGGGCTT length=64count=7 16 18 1517944 25 64M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCCCGCAAGCCGCCCCA length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACACGTTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACAGGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=4 16 18 1517944 25 64M * 0 0 GCCCGTCTACCCGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=2 16 18 1517944 25 64M * 0 0 GCCCGTCTCCACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=53 16 18 1517944 37 64M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=1 16 18 1517944 25 64M * 0 0 CCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACACCCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count=1 0 10 10388735 37 58M1I5M * 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTAT length=64count=1 0 2 714861 37 64M * 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGAT length=64count=11 16 19 13463035 37 49M1I14M * 0 0 TGCCCGTCTACACGCTTGTGTCCCAT length=58count=1 0 2 14032437 37 4M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGA length=64count=1 0 2 14032437 37 4M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGA length=64count=1 16 19 13463036 37 48M2I14M * 0 0 GCCCGTCTACACGCTTGTGTCCCATG length=64count=1 0 6 20542400 37 64M * 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCC length=64count=1 16 5 15019027 37 49M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGAC length=64count=3 16 5 15019027 37 49M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGAC length=64count=1 16 5 15019027 37 49M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGAC length=64count=1 16 5 15019027 37 49M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGAC length=64count=1 0 6 20542400 37 4M1I59M * 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTC length=64count=1 0 8 18851188 37 64M * 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAA length=64count=5 16 19 13463034 23 64M * 0 0 CTGCCCGTCTACACGCTTGTGTCCCATGCACGCA length=64count=1 0 5 6176480 37 64M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAG length=64count=7 0 5 6176480 37 64M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAG length=57count=31 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count=4 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count=1 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count=1 16 5 15019027 37 16M1I47M * 0 0 TCCATTGTTGTATCTTCGATTGCAGA SAMConverter & TagsOnPhysicalMap (TOPM) TOPM is the key file to interpret tags present in a species. Contains: Tag Sequence Position Divergence from reference Polymorphisms Genetic mapping support 9

TagsOnPhysicalMap File 6040401 2 4 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG 64 0 17 0 14761759 1476182 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG 64 0 10 1 10388735 1038879 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA 64 0 2 1 714861 714925 0 CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA 64 0 19 0 13463035 1346309 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA 64 0 2 1 14032437 1403250 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA 64 0 2 1 14032437 1403250 CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 19 0 13463036 1346310 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT 64 0 6 1 20542400 2054246 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT 64 0 6 1 20542400 2054246 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG 64 0 8 1 18851188 1885125 CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG 64 0 19 0 13463034 1346309 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 5 1 6176480 6176544 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 5 1 6176480 6176544 0 BWA sensitivity is pretty poor Alignment Class BWA Bowtie2 Single Best Mapping 57% 69% Multiple Mapping 17% 17% Unmapped 26% 14% BLAST about the same as Bowtie2. Code needs to be updated to parse Bowtie2. Many of the multiple mapping do NOT map with 100% identity, which suggests they can be genetically mapped. 10

Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) Tags by Taxa 6040401 2 88 08.0731-5 chardonnay 08.0731-19 08.0731-29 08.0731-6 08.0731-24 08.0731-37 08.0731-15 08 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 0 0 1 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 1 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC 64 0 1 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 1 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT 64 0 0 1 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG 64 0 1 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT 64 0 1 0 0 0 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG 64 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 0 0 0 1 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 1 0 0 1 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA 64 1 1 0 0 1 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT 64 1 0 0 0 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA 64 0 0 0 0 1 0 0 11

Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) TagsToSNPByAlignment Tags that align to the same region are aligned against one another and SNPs and small indels are identified. Based on the alignments SNPs are propagated to specific lines having that tag into a HapMap file. Parameters: chromosomes to search for SNPs bi or tri-allelic SNPs Indels Genetic mapping support Max markers on a chromosome 12

HapMap Format rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRI S1_2100 A/G 1 2100 + N N N N N S1_2163 T/C 1 2163 + N N N N N S1_13837 T/G 1 13837 + N N N N S1_14606 C/T 1 14606 + N N C N S1_20601 T/A 1 20601 + T N N N S1_68332 C/T 1 68332 + N N N N S1_68596 A/T 1 68596 + A N N N S1_69309 G/A 1 69309 + N G N N S1_79955 T/G 1 79955 + N T G T S1_79961 T/G 1 79961 + N T T T S1_80584 G 1 80584 + N N N N S1_80647 C/T 1 80647 + N N N N S1_81274 T/G 1 81274 + N N N N S1_108834 G/A 1 108834 + N N N N S1_112345 T/G 1 112345 + N N N N S1_115359 C/T 1 115359 + N N N N S1_115362 T/C 1 115362 + N N N N S1_115405 G/A 1 115405 + G G A N S1_115516 T/G 1 115516 + N N T N S1_116694 A/G 1 116694 + N A G N S1_119016 C/T 1 119016 + N N N N S1_155366 T/C 1 155366 + N T N N Why another pipeline? The last maize build (21000 taxa) with the discovery pipeline took over 2 weeks. Most common alleles have been idenbfied ader the first few discovery builds Use the informabon from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate. 13

GBS bioinformabcs pipeline Discovery Tags by Taxa Tag Counts TOPM SNP Caller Genotypes GBS bioinformabcs pipeline Discovery Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes 14

GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Discovery ProducCon Tags by Taxa Tag Counts TOPM TagsOnPhysicalMap (TOPM) SNP Caller Genotypes 15

GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes 16

GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes Genotypes 17

Running the ProducBon Pipeline Required Files: Sequence file (fastq or qseq) Key file ProducBon TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a Bme HapMap files by chromosome ~7 minutes TesBng ProducBon Pipeline Compared HapMap files produced by Discovery Pipeline and ProducBon Pipeline Site Comparison: Discovery 48,139 ProducBon 47,676 Difference due to maximum 8 alleles 99.98% correlabon of genebc distance matrices 18

Shifting to HDF5 Hierarchical Data Format supports very large data sets and complex data structures. Widely used in climate and astromonomy communities TBT files can approach 2 Tb in size Compressed HDF5 can be 40 times smaller Access times looks very good Working to fuse TOPM, TBT, and Keyfile into one HDF5 repository Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net 19

Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010 20

Maize genetic variation has been evolving for 5 million years Warm Pliocene 5mya 4mya Modern Variation Begins Evolving Sister Genus Diverges Divergence from Chimps Ardipithecus 3mya Australopithecus Cold Pleistocene 2mya 1mya Zea species begin diverging Maize domesticated Homo erectus Modern Variation Begins Modern Humans What are our expectations with GBS? 21

High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize line Potentially all of these are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative Expectation of marker distribution Biallelic, 17% Presense / Absense, 50% Nonpolymor phic; 18% Biparental population Too Repetitiv e, 15% Presense / Absense, 50% Multialleli c, 34% Too Repetitiv e, 15% Nonpolymorp hic; 1% Across the species 22

Sequencing Error Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable 23

Reads with errors Perfect sequences: 0.99 64 =52.5% of the 64bp sequences are perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher. Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major 4950 4900 C Minor 17 67 (50 real) G Error 17 17 T Error 17 17 24

Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs 25

Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD BiParentalErrorCorrectionPlugin Error rate estimation, LD filters Imputation MergeIdenticalTaxaPlugin Error rate estimation, LD filters INBREDS PARTIALLY SOLVED HapMap GWAS HETEROZYGOUS NOT SOLVED YET Imputation & Phasing Kinship Distance Phylogeny LD GS Process File (data structure) Filters in TagsToSNPByAlignmentMTPlugin Only calls bi-allelic (hard coded now) Two most common alleles used Inbreeding coefficient (-mnf) If have inbred samples definitely use, very powerful for errors and paralogues Minimum minor allele frequency (-mnmaf) Very important if do not have other tools for filtering (bi-parental populations or LD) Set for >=1% if no other filter method present 26

MergeDuplicateSNPsPlugin When restriction sites are less than 128bp apart, we may read SNP from both directions (strands) ~13% of all sites Fusing increases coverage Fixes errors -mismat = set maximum mismatch rate -callhets = mismatch set to hets or not GBSHapMapFiltersPlugin Basic filters for coverage of sites, taxa inbreeding coefficient, and LD -mntcov = minimum taxa coverage (e.g.0.05) -mnscov = minimum site coverage, proportion of taxa with call (e.g. 0.10) -mnmaf = minimum minor allele frequency (e.g. 0.01) 27

GBSHapMapFiltersPlugin -mnf = minimum inbreeding coefficient (e.g. 0.9) Don t use with outcrossers -hld = require that sites are in high local LD, currently parameters are hard coded, so difficult to tune without using the code. Tests a sliding window of 100 surrounding sites, and looks for a Bonferonni corrected P<0.01 Useful but can be slow option. More work needed here. Biparental populations Limited range of alleles, expected allele frequencies, high LD 28

Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating 29

Bi-parental populations allow identification of error, and non-mendelian segregation Error Median error rate is 0.004, but there is a long tail of some high error sites Median 30

BiParentalErrorCorrectionPlugin -popm = REGEX population identification(e.g. Z[0-9]{3} ) -popf = population File (not implemented) instead of popm option -mxe = maximum error rate (e.g. 0.01); calculated from non-segregating populations BiParentalErrorCorrectionPlugin -mnd = distortion from expectation (e.g. 2.0); the test uses both the binomial distribution and this distortion to classify segregation. -mnpld = minimum linkage disequilibrum r 2 = 0.5; this is calculated within each population, and then the median across segregating populations is used 31

MergeIdenticalTaxaPlugin Fuse taxa with the same name. Useful for checks and duplicated runs. Also useful in determining error rates -xhets = exclude heterozygotes calls (e.g. true) -hetfreq= frequency between hets and homozygous calls (e.g. 0.76) Product of Filtering After filters, in maize we find 0.0018 error rate AA<>aa = < 0.0018 AA<>Aa = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species. 32

Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD BiParentalErrorCorrectionPlugin Error rate estimation, LD filters Imputation MergeIdenticalTaxaPlugin Error rate estimation, LD filters INBREDS PARTIALLY SOLVED HapMap GWAS HETEROZYGOUS Partially SOLVED Imputation & Phasing Kinship Distance Phylogeny LD GS Process File (data structure) Two major sources: Sampling Missing Data Low coverage often used in big genomes with inbred lines Differential coverage caused by fragment size biases Biological Region on genome not shared between lines Cut site polymorphisms We want to impute the missing sampling but not the biological 33

Standard Imputation Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc. These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition Some can be slow for sample sizes that we have. FastImputationBitFixedWindow Imputation approach focused on speed and large sets of taxa with some closely related individuals. Nearest neighbor approach, fixed window sizes Strengths: Very accurate <1% error, much faster than other algorithms 100X Weakness: Not good a recombination junctions, heterozgyosity Code in TASSEL not plugin, but available 34

Hidden Markov Model TASSEL GBS Imputation Developed by Peter Bradbury Aimed a GBS and biparental populations Hidden Markov Model Very accurate at determining boundaries Works well on Maize NAM inbred lines, and probably others. AA <> BB error rate 0.00005 AB > AA 0.0278 Most problem appears in faulty populations Available as TASSEL 4.0 plugin Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010 35

Mapping all the alleles (TagCallerAgainstAnchor) Most maize alleles have no position on the reference map Map allele presence (TagsByTaxa) versus a anchor SNP map (HapMap) 8.7M alleles were mapped in <24 hours using 100 CPU cluster Alleles Physical and genetic mapping of 8.7 million GBS alleles Gene$c&and&Physical&Agree& Gene$c&and&Physical&Disagree& Not&in&Physical,&Gene$cally& mapped& Complex&mapping&or&modest& power&currently& Consistent&Error&or&Evenly& repe$$ve& Only 29% of alleles are simple - physical and genetic agree 55% of alleles are easily genetically mappable Reads Reads&with&strong& gene/c&and/or& BLAST&posi/on& Reads&with&weaker& posi/on&hypothesis& Reads&with&no& hypothesis&(error&or& even&repe//ve)& Many complex alleles are rarer, so 71% of alleles are genetic and/or physically interpretable. With more samples and better error models perhaps 90% will be useable 36

Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing Future Need better integration of Whole Genome Sequence data with pipeline Add information on premature cut sites or mutated cut sites Use paired-end read information Full incorporation of presence/absence variants Increase range of imputation tools and phasing for structure populations Quantitative genotype tools for polyploids/ GS 37