GBS Bioinformatics Pipeline

Size: px
Start display at page:

Download "GBS Bioinformatics Pipeline"

Transcription

1 GBS Bioinformatics Pipeline...or, Where Your Data Go After Sequencing James Harriman Ed Buckler Jeff Glaubitz Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) 1

2 Non- Reference Genome Pipeline QseqToTagCount TagCounts per lane Merge TagsCounts Qseq Key files QseqToTBT TagsByTaxa files (1 per lane) Merge TagsByTaxa TagCounts for species (Master Tags) TagsByTaxa for species TagHomology PhaseNoAnchor HapMap Process File (data structure) Raw Sequence (Qseq) HWI-ST GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGC HWI-ST GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGC HWI-ST ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATT HWI-ST CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCA HWI-ST GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAG HWI-ST TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGT HWI-ST CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTG HWI-ST CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGA HWI-ST GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCC HWI-ST AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTC HWI-ST CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTT HWI-ST TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGC HWI-ST GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAA HWI-ST GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCA HWI-ST TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACG HWI-ST GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGA HWI-ST TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAG HWI-ST GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGC HWI-ST CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGC HWI-ST CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAA HWI-ST CCAGCTCAGCATGGATCTCTCCTTGATGGACTGAAAGCGCGTGTGCTCCCCTGTGTGATGGAAAGTGGCAGTG HWI-ST CCAGCTCAGCTCAAGCATTGGCTTCCGCTTTGGCATCCTGGAGGGTAAGCTTCTGCTCTTCTCACTAGAGGAG HWI-ST ACAAACAGCAGAGGTCGCATTGTAGTTAGTCCGGGACTTGCCCAGTTCATTGCTGAGATCGGAAGAGCGGTTC HWI-ST GCTCTACAGCTTCTGGCCAGAATGCTTTTGGCACTTGTTTGTCACAAAGCATGCACTGAACCATATTCATGATAG HWI-ST TTCTCCAGCTGCTACATGCACCGTGGGAAGAAGGTCTGCCCCACATACCCACCAGCCATCGCCCTTCTCACAT HWI-ST GAGATACAGCTGCGAATTGGGGGTTCCTGTGTTGCGAAGTGGCACTCGTGTGCCAAACTTGGCTACGCAGAGA HWI-ST AAAAGTTCAGCAATACCTGTTGAAGCCAAGCCCTTGTGGTGATTGCCTCGTTCATTGCTGCTGAGATCGGAAGA HWI-ST GAATCTGCTACTAGTGAGCCTTTGTATGGGGACCGAGTTCAGAAGCTCTAACCCTCGTTTTCCCATCTGCTGAG HWI-ST TAGCATGCCTGCTGCAGGAGTTGGTGCCCAGCATTCTCAGGTGTAGTCCAAATTCTGTCTGATACTTATTGTTTA HWI-ST TTCAGACAGATGATGCTTGTCAAGGGTCACCATCTTGCATTGCGCTGCGTCACATCCTTAGTGGGAATAGGGGA HWI-ST CTTGCTTCAGCCATGTAGAGTGGTGTTGCTCCTTTACTACCACGAATCATTGGTAACTCCCTGTTCTTATTCACC HWI-ST TTCAGACAGCCAAACGACGTCTTAGTGGAGAAAATACCTGAGAAAAGTCAAGAAACCAAAACACTAAAAAATGA HWI-ST AGCCTCAGCTTGGTTGCTTGTGGTTGGGGGTGAGGGGGCGGGCGGGAACTTATGTTTGCGCCCCGAGGCGG HWI-ST CTTGACTGGGCGTGGTGCTGAGGCTACTGCGGAATTGAGGTGTTGTCATCCACCGGATTGGGTCGTAGGGCG HWI-ST TTCAGACAGCCAACTGAGATGACTCTCATTCTTGGTAGGAACCAATTTCTGAGAGCTTCGTAATGACATCAACTA HWI-ST GAGATACAGCAACAAATGATGTCATTCCTTGCAAAAGCTGTACAAAGCCCTGGTTTCTTAGCTCAGCTGGTACAG HWI-ST GTGTTTGGTCGTGAAAGTGGACCTCTTTCAGGTGCAGGTGCGAGTAGAAGGAGGTCCCAGAGACGTGCGGCT HWI-ST GAGAAACCGCAGAATGATAGCAAAAAGCGCGTTACAGGAGATATTAAGAAAAGGAGACTTGCAATGCAGGAGTA HWI-ST CGTCAACTGCATGAAGGAGGTTGTCTGGCCGTTGGAGGAGTGATTTTGGAAGGCTGAGATCGGAAGAAAGGT HW 2

3 Assignment to Samples Barcode sequences from the plate map are compared to barcode sequences in the reads, in order to associate reads with the samples from which they originate. Parameters: Users supply a plate map and staff members supply DNA barcodes. These are combined into a table of barcodes by sample. Plate Map Project Details Sample Details Organism Detail Project Name Source Lab Plate Name Well Sample Name Pedigree Population Stock Number Sample BREAD Buckler BREAD-Maize-A A01 PI inbred 04A0160A Wenyan Zhu BREAD Buckler BREAD-Maize-A B01 blank plantae BREAD Buckler BREAD-Maize-A C01 PI inbred 04A0191B Wenyan Zhu BREAD Buckler BREAD-Maize-A D01 PI inbred 04A0165A Wenyan Zhu BREAD Buckler BREAD-Maize-A E01 PI inbred 04A0193B Wenyan Zhu BREAD Buckler BREAD-Maize-A F01 CML91 inbred 04A0005BA Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A G01 CML311 inbred 04A0301A Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A H01 CML311 inbred 04A0200A Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A A02 MR_ (PI x PI655998)S4 PI x PI A0281A 10 BREAD Buckler BREAD-Maize-A B02 MR_ (PI x PI655998)S4 PI x PI A0279B 10 BREAD Buckler BREAD-Maize-A C02 MR_ (PI x PI655998)S4 PI x PI A0164B 10 BREAD Buckler BREAD-Maize-A D02 MR_ (PI x PI655998)S4 PI x PI A0163A 10 BREAD Buckler BREAD-Maize-A E02 MR_ (PI x PI655998)S4 PI x PI A0315B 10 BREAD Buckler BREAD-Maize-A F02 MR_ (PI x PI655998)S4 PI x PI F146114A 10 BREAD Buckler BREAD-Maize-A G02 MR_ (PI x PI655998)S4 PI x PI A0289B 10 BREAD Buckler BREAD-Maize-A H02 MR_ (PI x PI655998)S4 PI x PI A0171A 10 BREAD Buckler BREAD-Maize-A A03 MR_ (PI x PI655998)S4 PI x PI A0170B 10 BREAD Buckler BREAD-Maize-A B03 MR_ (PI x PI655998)S4 PI x PI A0381B 10 BREAD Buckler BREAD-Maize-A C03 MR_ (PI x PI655998)S4 PI x PI A0258A 10 BREAD Buckler BREAD-Maize-A D03 MR_ (PI x PI655998)S4 PI x PI A0304B 10 Cacao Buckler BREAD-Maize-A E03 Tc1536 Catie F1 04A0216A Jemmy Takrama Cacao Buckler BREAD-Maize-A F03 Tc7959 Brazil F2 04A0255A Jemmy Takrama BREAD Buckler BREAD-Maize-A G03 PI inbred 04A0217A Wenyan Zhu BREAD Buckler BREAD-Maize-A H03 PI inbred 04A0167A Wenyan Zhu BREAD Buckler BREAD-Maize-A A04 PI inbred 04P160451A Wenyan Zhu BREAD Buckler BREAD-Maize-A B04 PI17548 inbred 04A0258B Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A C04 PI inbred 04A0244A Wenyan Zhu BREAD Buckler BREAD-Maize-A D04 PI inbred 04A0298A Wenyan Zhu BREAD Buckler BREAD-Maize-A E04 PI inbred 04A0293B Wenyan Zhu BREAD Buckler BREAD-Maize-A F04 PI inbred 04A0296A Wenyan Zhu 3

4 Example DNA Barcode Key Flowcell Lane barcode sample Plate# Row Column PlateName 434GFAAXX 2 CTCC M A 1 IBM1 1A01 434GFAAXX 2 TGCA M A 2 IBM1 1A02 434GFAAXX 2 ACTA M A 3 IBM1 1A03 434GFAAXX 2 GTCT M A 4 IBM1 1A04 434GFAAXX 2 GAAT M A 5 IBM1 1A05 434GFAAXX 2 GCGT M A 6 IBM1 1A06 434GFAAXX 2 TGGC M A 7 IBM1 1A07 434GFAAXX 2 CGAT M A 8 IBM1 1A08 434GFAAXX 2 CTTGA M A 9 IBM1 1A09 434GFAAXX 2 TCACC M A 10 IBM1 1A10 434GFAAXX 2 CTAGC M A 11 IBM1 1A11 434GFAAXX 2 ACAAA M A 12 IBM1 1A12 434GFAAXX 2 TTCTC M B 1 IBM1 1B01 434GFAAXX 2 AGCCC M B 2 IBM1 1B02 434GFAAXX 2 GTATT M B 3 IBM1 1B03 434GFAAXX 2 CTGTA M B 4 IBM1 1B04 434GFAAXX 2 AGCAT M B 5 IBM1 1B05 434GFAAXX 2 ACTAT M B 6 IBM1 1B06 434GFAAXX 2 GAGAAT M B 7 IBM1 1B07 434GFAAXX 2 CCAGCT M B 8 IBM1 1B08 434GFAAXX 2 TTCAGA M B 9 IBM1 1B09 434GFAAXX 2 TAGGAA unknown 1 B 10 IBM1 1B10 Notes on Names & Chromosomes Chromosomes (or contigs MUST be integers) Samples Names some Advice: NO spaces NO : Try to avoid weird characters. 4

5 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) QSeqToTagCounts Processes a Qseq file so we know what alleles (tags) are present in the the sample Handles sequence quality issue Identifies the barcodes Removes problem tags Counts tags 5

6 GBS Restriction Fragment Structure Barcode adapter Cut site Read Cut site Common adapter Accepted read Barcode adapter Cut site Read Rejected or Trimmed reads Potential chimeric sequence Barcode adapter Cut site Read Cut site Sequence Short sequence Cut site Read Cut site Common adapter Adapter dimer Barcode adapter Cut site Common adapter Sequence Processing Raw sequence data is processed into unique 64-bp sequences. For example: CTCCCAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC GTTGAACAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC Becomes: CAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC 64 2 Parameters: Restriction enzyme Different enzymes will create different sequence motifs, such as overlapping cut sites, palindromes or wobble bases. Barcode Barcode sequences must be provided to identify acceptable reads. Number of identical sequences accepted This gives investigators the option to ignore repetitive sequences or singleton reads. 6

7 TagCounts File Number of Tags Max Size of Tag x 32bp Tag Sequence Count Length (bp) CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT 64 1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT 64 2 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTTGGCACTCAAGCCCAAAACCACAGATCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGTAATTTGTTGTCTCATACCTCATACCACAGAACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTTTTCCAACCCCAAAACCCAAGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTTTCCCAAACCCCAAACCCCAGGCTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAGGGATAGGGAAGATGGGGGAGAGTGGCGGCCACGCATGGAA 64 1 CAGCAAAAAAAAAAAAAAAAAACAACAAGGAATTTGGGTATTCATTCCCCATACCCCAGGATTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACAAAAAAATTTGTTTTCTCAACCCCAAAACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 2 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGGAATTGAATCTCTCACACCTTAAAACACCGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAATTATTTGAAAGATCATTACCCTATACCACGGGGTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTGATGTCTCATACCCCATACCACAGGACTCCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTTATTTCTCATACCCCAAACCCCAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAGAATTTTATGTCTCATACCTCAAACCAAAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAATAAATTTGTTGCTCATACCCCAAACCACAGGGCTTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGCAATTTGATTCCACTTAATCTATCCCACAGAACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTGTTTCCCTAACCCCAAAACCACGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAATGAATTTGTAGTGCCAAACCCCAAACCAACGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCCAAGAAATTTGATGTCTCATACCCCAAACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAGACCAGGTAATTATTGCTCACATACATCAAACTCCAATTGCC 64 1 CAGCAAAAAAAAAAAAAAAAAAGCGCCTAACGTTTCAAAATGAATGAGTTGCCAACCAAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAGGGTTAGGAAAGATGGGTGGGAGGGGCGGGCCTGCTTGAAAT 64 1 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) 7

8 Unique Reads CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC + CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT + CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT + CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT + CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT + CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT + CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC + CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT + CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT + CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC + CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC + CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC BWA (Burrows-Wheeler Aligner) Aligns the tags in FASTA format to the reference genome Parameters: Similarity of read sequence and genome sequence. This controls the tradeoff between number of SNPs and confidence in the alignment. Default is 4 edits per sequence. Gap penalty. This controls sensitivity to indels. Default is no indels within 5bp of the read ends. Outputs a SAM Alignment There are many other aligners. BWA is fast and memory efficient, but may not be appropriate for your species 8

9 Generic Alignment (SAM) length=64count= M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count= M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count= M2I9M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count= M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCT length=64count= M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCT length=64count= M3D47M2I11M * 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGA length=64count= M * 0 0 CCTTTCTTGGCCTGGTTCTCACTCATCTGGGCTT length=64count= M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCCCGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACACGTTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACAGGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACCCGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTCCACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 CCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACACCCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M1I5M * 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTAT length=64count= M * 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGAT length=64count= M1I14M * 0 0 TGCCCGTCTACACGCTTGTGTCCCAT length=58count= M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGA length=64count= M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGA length=64count= M2I14M * 0 0 GCCCGTCTACACGCTTGTGTCCCATG length=64count= M * 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCC length=64count= M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGAC length=64count= M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGAC length=64count= M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGAC length=64count= M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGAC length=64count= M1I59M * 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTC length=64count= M * 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAA length=64count= M * 0 0 CTGCCCGTCTACACGCTTGTGTCCCATGCACGCA length=64count= M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAG length=64count= M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAG length=57count= M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count= M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count= M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count= M1I47M * 0 0 TCCATTGTTGTATCTTCGATTGCAGA SAMConverter & TagsOnPhysicalMap (TOPM) TOPM is the key file to interpret tags present in a species. Contains: Tag Sequence Position Divergence from reference Polymorphisms Genetic mapping support 9

10 TagsOnPhysicalMap File CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC BWA sensitivity is pretty poor Alignment Class BWA Bowtie2 Single Best Mapping 57% 69% Multiple Mapping 17% 17% Unmapped 26% 14% BLAST about the same as Bowtie2. Code needs to be updated to parse Bowtie2. Many of the multiple mapping do NOT map with 100% identity, which suggests they can be genetically mapped. 10

11 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) Tags by Taxa chardonnay CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA

12 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) TagsToSNPByAlignment Tags that align to the same region are aligned against one another and SNPs and small indels are identified. Based on the alignments SNPs are propagated to specific lines having that tag into a HapMap file. Parameters: chromosomes to search for SNPs bi or tri-allelic SNPs Indels Genetic mapping support Max markers on a chromosome 12

13 HapMap Format rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRI S1_2100 A/G N N N N N S1_2163 T/C N N N N N S1_13837 T/G N N N N S1_14606 C/T N N C N S1_20601 T/A T N N N S1_68332 C/T N N N N S1_68596 A/T A N N N S1_69309 G/A N G N N S1_79955 T/G N T G T S1_79961 T/G N T T T S1_80584 G N N N N S1_80647 C/T N N N N S1_81274 T/G N N N N S1_ G/A N N N N S1_ T/G N N N N S1_ C/T N N N N S1_ T/C N N N N S1_ G/A G G A N S1_ T/G N N T N S1_ A/G N A G N S1_ C/T N N N N S1_ T/C N T N N Why another pipeline? The last maize build (21000 taxa) with the discovery pipeline took over 2 weeks. Most common alleles have been idenbfied ader the first few discovery builds Use the informabon from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate. 13

14 GBS bioinformabcs pipeline Discovery Tags by Taxa Tag Counts TOPM SNP Caller Genotypes GBS bioinformabcs pipeline Discovery Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes 14

15 GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Discovery ProducCon Tags by Taxa Tag Counts TOPM TagsOnPhysicalMap (TOPM) SNP Caller Genotypes 15

16 GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes 16

17 GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes Genotypes 17

18 Running the ProducBon Pipeline Required Files: Sequence file (fastq or qseq) Key file ProducBon TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a Bme HapMap files by chromosome ~7 minutes TesBng ProducBon Pipeline Compared HapMap files produced by Discovery Pipeline and ProducBon Pipeline Site Comparison: Discovery 48,139 ProducBon 47,676 Difference due to maximum 8 alleles 99.98% correlabon of genebc distance matrices 18

19 Shifting to HDF5 Hierarchical Data Format supports very large data sets and complex data structures. Widely used in climate and astromonomy communities TBT files can approach 2 Tb in size Compressed HDF5 can be 40 times smaller Access times looks very good Working to fuse TOPM, TBT, and Keyfile into one HDF5 repository Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University 19

20 Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in

21 Maize genetic variation has been evolving for 5 million years Warm Pliocene 5mya 4mya Modern Variation Begins Evolving Sister Genus Diverges Divergence from Chimps Ardipithecus 3mya Australopithecus Cold Pleistocene 2mya 1mya Zea species begin diverging Maize domesticated Homo erectus Modern Variation Begins Modern Humans What are our expectations with GBS? 21

22 High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize line Potentially all of these are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative Expectation of marker distribution Biallelic, 17% Presense / Absense, 50% Nonpolymor phic; 18% Biparental population Too Repetitiv e, 15% Presense / Absense, 50% Multialleli c, 34% Too Repetitiv e, 15% Nonpolymorp hic; 1% Across the species 22

23 Sequencing Error Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable 23

24 Reads with errors Perfect sequences: =52.5% of the 64bp sequences are perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher. Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major C Minor (50 real) G Error T Error

25 Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs 25

26 Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD BiParentalErrorCorrectionPlugin Error rate estimation, LD filters Imputation MergeIdenticalTaxaPlugin Error rate estimation, LD filters INBREDS PARTIALLY SOLVED HapMap GWAS HETEROZYGOUS NOT SOLVED YET Imputation & Phasing Kinship Distance Phylogeny LD GS Process File (data structure) Filters in TagsToSNPByAlignmentMTPlugin Only calls bi-allelic (hard coded now) Two most common alleles used Inbreeding coefficient (-mnf) If have inbred samples definitely use, very powerful for errors and paralogues Minimum minor allele frequency (-mnmaf) Very important if do not have other tools for filtering (bi-parental populations or LD) Set for >=1% if no other filter method present 26

27 MergeDuplicateSNPsPlugin When restriction sites are less than 128bp apart, we may read SNP from both directions (strands) ~13% of all sites Fusing increases coverage Fixes errors -mismat = set maximum mismatch rate -callhets = mismatch set to hets or not GBSHapMapFiltersPlugin Basic filters for coverage of sites, taxa inbreeding coefficient, and LD -mntcov = minimum taxa coverage (e.g.0.05) -mnscov = minimum site coverage, proportion of taxa with call (e.g. 0.10) -mnmaf = minimum minor allele frequency (e.g. 0.01) 27

28 GBSHapMapFiltersPlugin -mnf = minimum inbreeding coefficient (e.g. 0.9) Don t use with outcrossers -hld = require that sites are in high local LD, currently parameters are hard coded, so difficult to tune without using the code. Tests a sliding window of 100 surrounding sites, and looks for a Bonferonni corrected P<0.01 Useful but can be slow option. More work needed here. Biparental populations Limited range of alleles, expected allele frequencies, high LD 28

29 Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating 29

30 Bi-parental populations allow identification of error, and non-mendelian segregation Error Median error rate is 0.004, but there is a long tail of some high error sites Median 30

31 BiParentalErrorCorrectionPlugin -popm = REGEX population identification(e.g. Z[0-9]{3} ) -popf = population File (not implemented) instead of popm option -mxe = maximum error rate (e.g. 0.01); calculated from non-segregating populations BiParentalErrorCorrectionPlugin -mnd = distortion from expectation (e.g. 2.0); the test uses both the binomial distribution and this distortion to classify segregation. -mnpld = minimum linkage disequilibrum r 2 = 0.5; this is calculated within each population, and then the median across segregating populations is used 31

32 MergeIdenticalTaxaPlugin Fuse taxa with the same name. Useful for checks and duplicated runs. Also useful in determining error rates -xhets = exclude heterozygotes calls (e.g. true) -hetfreq= frequency between hets and homozygous calls (e.g. 0.76) Product of Filtering After filters, in maize we find error rate AA<>aa = < AA<>Aa = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species. 32

33 Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD BiParentalErrorCorrectionPlugin Error rate estimation, LD filters Imputation MergeIdenticalTaxaPlugin Error rate estimation, LD filters INBREDS PARTIALLY SOLVED HapMap GWAS HETEROZYGOUS Partially SOLVED Imputation & Phasing Kinship Distance Phylogeny LD GS Process File (data structure) Two major sources: Sampling Missing Data Low coverage often used in big genomes with inbred lines Differential coverage caused by fragment size biases Biological Region on genome not shared between lines Cut site polymorphisms We want to impute the missing sampling but not the biological 33

34 Standard Imputation Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc. These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition Some can be slow for sample sizes that we have. FastImputationBitFixedWindow Imputation approach focused on speed and large sets of taxa with some closely related individuals. Nearest neighbor approach, fixed window sizes Strengths: Very accurate <1% error, much faster than other algorithms 100X Weakness: Not good a recombination junctions, heterozgyosity Code in TASSEL not plugin, but available 34

35 Hidden Markov Model TASSEL GBS Imputation Developed by Peter Bradbury Aimed a GBS and biparental populations Hidden Markov Model Very accurate at determining boundaries Works well on Maize NAM inbred lines, and probably others. AA <> BB error rate AB > AA Most problem appears in faulty populations Available as TASSEL 4.0 plugin Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in

36 Mapping all the alleles (TagCallerAgainstAnchor) Most maize alleles have no position on the reference map Map allele presence (TagsByTaxa) versus a anchor SNP map (HapMap) 8.7M alleles were mapped in <24 hours using 100 CPU cluster Alleles Physical and genetic mapping of 8.7 million GBS alleles Gene$c&and&Physical&Agree& Gene$c&and&Physical&Disagree& Not&in&Physical,&Gene$cally& mapped& Complex&mapping&or&modest& power&currently& Consistent&Error&or&Evenly& repe$$ve& Only 29% of alleles are simple - physical and genetic agree 55% of alleles are easily genetically mappable Reads Reads&with&strong& gene/c&and/or& BLAST&posi/on& Reads&with&weaker& posi/on&hypothesis& Reads&with&no& hypothesis&(error&or& even&repe//ve)& Many complex alleles are rarer, so 71% of alleles are genetic and/or physically interpretable. With more samples and better error models perhaps 90% will be useable 36

37 Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing Future Need better integration of Whole Genome Sequence data with pipeline Add information on premature cut sites or mutated cut sites Use paired-end read information Full incorporation of presence/absence variants Increase range of imputation tools and phasing for structure populations Quantitative genotype tools for polyploids/ GS 37

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the

More information

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from

More information

Fei Lu. Post doctoral Associate Cornell University

Fei Lu. Post doctoral Associate Cornell University Fei Lu Post doctoral Associate Cornell University http://www.maizegenetics.net Genotyping by sequencing (GBS) is simple and cost effective 1. Digest DNA 2. Ligate adapters with barcodes 3. Pool DNAs 4.

More information

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) Kelly Swarts PAG Allele Mining 1/11/2014 Imputation is the projection

More information

Genotyping By Sequencing (GBS) Method Overview

Genotyping By Sequencing (GBS) Method Overview enotyping By Sequencing (BS) Method Overview Sharon E Mitchell Institute for enomic Diversity Cornell University http://wwwmaizegeneticsnet/ Topics Presented Background/oals BS lab protocol Illumina sequencing

More information

Genotyping By Sequencing (GBS) Method Overview

Genotyping By Sequencing (GBS) Method Overview enotyping By Sequencing (BS) Method Overview RJ Elshire, JC laubitz, Q Sun, JV Harriman ES Buckler, and SE Mitchell http://wwwmaizegeneticsnet/ Topics Presented Background/oals BS lab protocol Illumina

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

opulation genetics undamentals for SNP datasets

opulation genetics undamentals for SNP datasets opulation genetics undamentals for SNP datasets with crocodiles) Sam Banks Charles Darwin University sam.banks@cdu.edu.au I ve got a SNP genotype dataset, now what? Do my data meet the requirements of

More information

Accounting for read depth in the analysis of genotyping-by-sequencing data

Accounting for read depth in the analysis of genotyping-by-sequencing data Accounting for read depth in the analysis of genotyping-by-sequencing data Ken Dodds, John McEwan, Timothy Bilton, Rudi Brauning, Rayna Anderson, Tracey Van Stijn, Theodor Kristjánsson, Shannon Clarke

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Detecting selection from differentiation between populations: the FLK and hapflk approach. Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

Introduction to PLINK H3ABionet Course Covenant University, Nigeria

Introduction to PLINK H3ABionet Course Covenant University, Nigeria UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG Introduction to PLINK H3ABionet Course Covenant University, Nigeria Scott Hazelhurst H3ABioNet funded by NHGRI grant number U41HG006941 Wits Bioinformatics

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

High-throughput sequencing: Alignment and related topic

High-throughput sequencing: Alignment and related topic High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

Maize Genetics Cooperation Newsletter Vol Derkach 1

Maize Genetics Cooperation Newsletter Vol Derkach 1 Maize Genetics Cooperation Newsletter Vol 91 2017 Derkach 1 RELATIONSHIP BETWEEN MAIZE LANCASTER INBRED LINES ACCORDING TO SNP-ANALYSIS Derkach K. V., Satarova T. M., Dzubetsky B. V., Borysova V. V., Cherchel

More information

Explore SNP polymorphism data. A. Dereeper, Y. Hueber

Explore SNP polymorphism data. A. Dereeper, Y. Hueber Explore SNP polymorphism data A. Dereeper, Y. Hueber Bioinformatics trainings, Supagro, February, 2016 Tablet Graphical tool to visualize assemblies Accept many formats ACE, SAM, BAM GATK (Genome Analysis

More information

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6 Yamamoto et al. BMC Genetics 2014, 15:50 METHODOLOGY ARTICLE Open Access Effect of advanced intercrossing on genome structure and on the power to detect linked quantitative trait loci in a multi-parent

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Introduction to Linkage Disequilibrium

Introduction to Linkage Disequilibrium Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have

More information

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale

More information

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012 Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium November 12, 2012 Last Time Sequence data and quantification of variation Infinite sites model Nucleotide diversity (π) Sequence-based

More information

High-throughput sequence alignment. November 9, 2017

High-throughput sequence alignment. November 9, 2017 High-throughput sequence alignment November 9, 2017 a little history human genome project #1 (many U.S. government agencies and large institute) started October 1, 1990. Goal: 10x coverage of human genome,

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto

Hapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto Hapsembler version 2.1 ( + Encore & Scarpa) Manual Nilgun Donmez Department of Computer Science University of Toronto January 13, 2013 Contents 1 Introduction.................................. 2 2 Installation..................................

More information

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) 12/5/14 Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) Linkage Disequilibrium Genealogical Interpretation of LD Association Mapping 1 Linkage and Recombination v linkage equilibrium ²

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Supplementary Information for Discovery and characterization of indel and point mutations

Supplementary Information for Discovery and characterization of indel and point mutations Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed

More information

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50 LECTURE #10 A. The Hardy-Weinberg Equilibrium 1. From the definitions of p and q, and of p 2, 2pq, and q 2, an equilibrium is indicated (p + q) 2 = p 2 + 2pq + q 2 : if p and q remain constant, and if

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Chapter 13 Meiosis and Sexual Reproduction

Chapter 13 Meiosis and Sexual Reproduction Biology 110 Sec. 11 J. Greg Doheny Chapter 13 Meiosis and Sexual Reproduction Quiz Questions: 1. What word do you use to describe a chromosome or gene allele that we inherit from our Mother? From our Father?

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Heterozygous BMN lines

Heterozygous BMN lines Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30

More information

Predictive Genome Analysis Using Partial DNA Sequencing Data

Predictive Genome Analysis Using Partial DNA Sequencing Data Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

Notes on Population Genetics

Notes on Population Genetics Notes on Population Genetics Graham Coop 1 1 Department of Evolution and Ecology & Center for Population Biology, University of California, Davis. To whom correspondence should be addressed: gmcoop@ucdavis.edu

More information

Computational Approaches to Statistical Genetics

Computational Approaches to Statistical Genetics Computational Approaches to Statistical Genetics GWAS I: Concepts and Probability Theory Christoph Lippert Dr. Oliver Stegle Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen

More information

Variant visualisation and quality control

Variant visualisation and quality control Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14

More information

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, Today: Review Probability in Populatin Genetics Review basic statistics Population Definition

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Lecture WS Evolutionary Genetics Part I 1

Lecture WS Evolutionary Genetics Part I 1 Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in

More information

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection CHAPTER 23 THE EVOLUTIONS OF POPULATIONS Section C: Genetic Variation, the Substrate for Natural Selection 1. Genetic variation occurs within and between populations 2. Mutation and sexual recombination

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang and Xiaoyong Zhou Indiana University at Bloomington

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Microsatellite evolution in Adélie penguins

Microsatellite evolution in Adélie penguins Microsatellite evolution in Adélie penguins Bennet McComish School of Mathematics and Physics Microsatellites Tandem repeats of motifs up to 6bp, e.g. (AC) 6 = ACACACACACAC Length is highly polymorphic.

More information

BIOLOGY 321. Answers to text questions th edition: Chapter 2

BIOLOGY 321. Answers to text questions th edition: Chapter 2 BIOLOGY 321 SPRING 2013 10 TH EDITION OF GRIFFITHS ANSWERS TO ASSIGNMENT SET #1 I have made every effort to prevent errors from creeping into these answer sheets. But, if you spot a mistake, please send

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants

More information

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

*: Division of Biological Sciences; University of Missouri; Columbia, MO, 65211

*: Division of Biological Sciences; University of Missouri; Columbia, MO, 65211 Genetics: Early Online, published on July 20, 2016 as 10.1534/genetics.116.191726 Fast-Flowering Mini-Maize: Seed to Seed in 60 Days Morgan E. McCaw*, Jason G. Wallace,1, Patrice S. Albert*, Edward S.

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Classical Selection, Balancing Selection, and Neutral Mutations

Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected

More information

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.! Integer Programming in Computational Biology D. Gusfield University of California, Davis Presented December 12, 2016. There are many important phylogeny problems that depart from simple tree models: Missing

More information

Principles of QTL Mapping. M.Imtiaz

Principles of QTL Mapping. M.Imtiaz Principles of QTL Mapping M.Imtiaz Introduction Definitions of terminology Reasons for QTL mapping Principles of QTL mapping Requirements For QTL Mapping Demonstration with experimental data Merit of QTL

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 CP: CHAPTER 6, Sections 1-6; CHAPTER 7, Sections 1-4; HN: CHAPTER 11, Section 1-5 Standard B-4: The student will demonstrate an understanding of the molecular

More information

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Repeat resolution. This exposition is based on the following sources, which are all recommended reading: Repeat resolution This exposition is based on the following sources, which are all recommended reading: 1. Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions,

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important? Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 3 Exercise 3.. a. Define random mating. b. Discuss what random mating as defined in (a) above means in a single infinite population

More information

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype. Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype. please read pages 38-47; 49-55;57-63. Slide 1 of Chapter 2 1 Extension sot Mendelian Behavior of Genes Single gene inheritance

More information

Levels of genetic variation for a single gene, multiple genes or an entire genome

Levels of genetic variation for a single gene, multiple genes or an entire genome From previous lectures: binomial and multinomial probabilities Hardy-Weinberg equilibrium and testing HW proportions (statistical tests) estimation of genotype & allele frequencies within population maximum

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

The Lander-Green Algorithm. Biostatistics 666 Lecture 22 The Lander-Green Algorithm Biostatistics 666 Lecture Last Lecture Relationship Inferrence Likelihood of genotype data Adapt calculation to different relationships Siblings Half-Siblings Unrelated individuals

More information

EXERCISES FOR CHAPTER 7. Exercise 7.1. Derive the two scales of relation for each of the two following recurrent series:

EXERCISES FOR CHAPTER 7. Exercise 7.1. Derive the two scales of relation for each of the two following recurrent series: Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 7 Exercise 7.. Derive the two scales of relation for each of the two following recurrent series: u: 0, 8, 6, 48, 46,L 36 7

More information

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Microsatellite data analysis. Tomáš Fér & Filip Kolář Microsatellite data analysis Tomáš Fér & Filip Kolář Multilocus data dominant heterozygotes and homozygotes cannot be distinguished binary biallelic data (fragments) presence (dominant allele/heterozygote)

More information

CNV Methods File format v2.0 Software v2.0.0 September, 2011

CNV Methods File format v2.0 Software v2.0.0 September, 2011 File format v2.0 Software v2.0.0 September, 2011 Copyright 2011 Complete Genomics Incorporated. All rights reserved. cpal and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries.

More information

Week 7.2 Ch 4 Microevolutionary Proceses

Week 7.2 Ch 4 Microevolutionary Proceses Week 7.2 Ch 4 Microevolutionary Proceses 1 Mendelian Traits vs Polygenic Traits Mendelian -discrete -single gene determines effect -rarely influenced by environment Polygenic: -continuous -multiple genes

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

Genetic diversity and population structure in rice. S. Kresovich 1,2 and T. Tai 3,5. Plant Breeding Dept, Cornell University, Ithaca, NY

Genetic diversity and population structure in rice. S. Kresovich 1,2 and T. Tai 3,5. Plant Breeding Dept, Cornell University, Ithaca, NY Genetic diversity and population structure in rice S. McCouch 1, A. Garris 1,2, J. Edwards 1, H. Lu 1,3 M Redus 4, J. Coburn 1, N. Rutger 4, S. Kresovich 1,2 and T. Tai 3,5 1 Plant Breeding Dept, Cornell

More information

Evolutionary Genetics Midterm 2008

Evolutionary Genetics Midterm 2008 Student # Signature The Rules: (1) Before you start, make sure you ve got all six pages of the exam, and write your name legibly on each page. P1: /10 P2: /10 P3: /12 P4: /18 P5: /23 P6: /12 TOT: /85 (2)

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

Meiosis and Mendel. Chapter 6

Meiosis and Mendel. Chapter 6 Meiosis and Mendel Chapter 6 6.1 CHROMOSOMES AND MEIOSIS Key Concept Gametes have half the number of chromosomes that body cells have. Body Cells vs. Gametes You have body cells and gametes body cells

More information

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,

More information

Potato Genome Analysis

Potato Genome Analysis Potato Genome Analysis Xin Liu Deputy director BGI research 2016.1.21 WCRTC 2016 @ Nanning Reference genome construction???????????????????????????????????????? Sequencing HELL RIEND WELCOME BGI ZHEN LLOFRI

More information

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM Life Cycles, Meiosis and Genetic Variability iclicker: 1. A chromosome just before mitosis contains two double stranded DNA molecules. 2. This replicated chromosome contains DNA from only one of your parents

More information

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Breeding Values and Inbreeding. Breeding Values and Inbreeding Breeding Values and Inbreeding Genotypic Values For the bi-allelic single locus case, we previously defined the mean genotypic (or equivalently the mean phenotypic values) to be a if genotype is A 2 A

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works EM algorithm The example in the book for doing the EM algorithm is rather difficult, and was not available in software at the time that the authors wrote the book, but they implemented a SAS macro to implement

More information