GBS Bioinformatics Pipeline
|
|
- Tyler May
- 5 years ago
- Views:
Transcription
1 GBS Bioinformatics Pipeline...or, Where Your Data Go After Sequencing James Harriman Ed Buckler Jeff Glaubitz Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) 1
2 Non- Reference Genome Pipeline QseqToTagCount TagCounts per lane Merge TagsCounts Qseq Key files QseqToTBT TagsByTaxa files (1 per lane) Merge TagsByTaxa TagCounts for species (Master Tags) TagsByTaxa for species TagHomology PhaseNoAnchor HapMap Process File (data structure) Raw Sequence (Qseq) HWI-ST GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGC HWI-ST GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGC HWI-ST ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATT HWI-ST CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCA HWI-ST GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAG HWI-ST TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGT HWI-ST CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTG HWI-ST CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGA HWI-ST GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCC HWI-ST AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTC HWI-ST CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTT HWI-ST TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGC HWI-ST GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAA HWI-ST GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCA HWI-ST TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACG HWI-ST GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGA HWI-ST TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAG HWI-ST GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGC HWI-ST CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGC HWI-ST CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAA HWI-ST CCAGCTCAGCATGGATCTCTCCTTGATGGACTGAAAGCGCGTGTGCTCCCCTGTGTGATGGAAAGTGGCAGTG HWI-ST CCAGCTCAGCTCAAGCATTGGCTTCCGCTTTGGCATCCTGGAGGGTAAGCTTCTGCTCTTCTCACTAGAGGAG HWI-ST ACAAACAGCAGAGGTCGCATTGTAGTTAGTCCGGGACTTGCCCAGTTCATTGCTGAGATCGGAAGAGCGGTTC HWI-ST GCTCTACAGCTTCTGGCCAGAATGCTTTTGGCACTTGTTTGTCACAAAGCATGCACTGAACCATATTCATGATAG HWI-ST TTCTCCAGCTGCTACATGCACCGTGGGAAGAAGGTCTGCCCCACATACCCACCAGCCATCGCCCTTCTCACAT HWI-ST GAGATACAGCTGCGAATTGGGGGTTCCTGTGTTGCGAAGTGGCACTCGTGTGCCAAACTTGGCTACGCAGAGA HWI-ST AAAAGTTCAGCAATACCTGTTGAAGCCAAGCCCTTGTGGTGATTGCCTCGTTCATTGCTGCTGAGATCGGAAGA HWI-ST GAATCTGCTACTAGTGAGCCTTTGTATGGGGACCGAGTTCAGAAGCTCTAACCCTCGTTTTCCCATCTGCTGAG HWI-ST TAGCATGCCTGCTGCAGGAGTTGGTGCCCAGCATTCTCAGGTGTAGTCCAAATTCTGTCTGATACTTATTGTTTA HWI-ST TTCAGACAGATGATGCTTGTCAAGGGTCACCATCTTGCATTGCGCTGCGTCACATCCTTAGTGGGAATAGGGGA HWI-ST CTTGCTTCAGCCATGTAGAGTGGTGTTGCTCCTTTACTACCACGAATCATTGGTAACTCCCTGTTCTTATTCACC HWI-ST TTCAGACAGCCAAACGACGTCTTAGTGGAGAAAATACCTGAGAAAAGTCAAGAAACCAAAACACTAAAAAATGA HWI-ST AGCCTCAGCTTGGTTGCTTGTGGTTGGGGGTGAGGGGGCGGGCGGGAACTTATGTTTGCGCCCCGAGGCGG HWI-ST CTTGACTGGGCGTGGTGCTGAGGCTACTGCGGAATTGAGGTGTTGTCATCCACCGGATTGGGTCGTAGGGCG HWI-ST TTCAGACAGCCAACTGAGATGACTCTCATTCTTGGTAGGAACCAATTTCTGAGAGCTTCGTAATGACATCAACTA HWI-ST GAGATACAGCAACAAATGATGTCATTCCTTGCAAAAGCTGTACAAAGCCCTGGTTTCTTAGCTCAGCTGGTACAG HWI-ST GTGTTTGGTCGTGAAAGTGGACCTCTTTCAGGTGCAGGTGCGAGTAGAAGGAGGTCCCAGAGACGTGCGGCT HWI-ST GAGAAACCGCAGAATGATAGCAAAAAGCGCGTTACAGGAGATATTAAGAAAAGGAGACTTGCAATGCAGGAGTA HWI-ST CGTCAACTGCATGAAGGAGGTTGTCTGGCCGTTGGAGGAGTGATTTTGGAAGGCTGAGATCGGAAGAAAGGT HW 2
3 Assignment to Samples Barcode sequences from the plate map are compared to barcode sequences in the reads, in order to associate reads with the samples from which they originate. Parameters: Users supply a plate map and staff members supply DNA barcodes. These are combined into a table of barcodes by sample. Plate Map Project Details Sample Details Organism Detail Project Name Source Lab Plate Name Well Sample Name Pedigree Population Stock Number Sample BREAD Buckler BREAD-Maize-A A01 PI inbred 04A0160A Wenyan Zhu BREAD Buckler BREAD-Maize-A B01 blank plantae BREAD Buckler BREAD-Maize-A C01 PI inbred 04A0191B Wenyan Zhu BREAD Buckler BREAD-Maize-A D01 PI inbred 04A0165A Wenyan Zhu BREAD Buckler BREAD-Maize-A E01 PI inbred 04A0193B Wenyan Zhu BREAD Buckler BREAD-Maize-A F01 CML91 inbred 04A0005BA Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A G01 CML311 inbred 04A0301A Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A H01 CML311 inbred 04A0200A Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A A02 MR_ (PI x PI655998)S4 PI x PI A0281A 10 BREAD Buckler BREAD-Maize-A B02 MR_ (PI x PI655998)S4 PI x PI A0279B 10 BREAD Buckler BREAD-Maize-A C02 MR_ (PI x PI655998)S4 PI x PI A0164B 10 BREAD Buckler BREAD-Maize-A D02 MR_ (PI x PI655998)S4 PI x PI A0163A 10 BREAD Buckler BREAD-Maize-A E02 MR_ (PI x PI655998)S4 PI x PI A0315B 10 BREAD Buckler BREAD-Maize-A F02 MR_ (PI x PI655998)S4 PI x PI F146114A 10 BREAD Buckler BREAD-Maize-A G02 MR_ (PI x PI655998)S4 PI x PI A0289B 10 BREAD Buckler BREAD-Maize-A H02 MR_ (PI x PI655998)S4 PI x PI A0171A 10 BREAD Buckler BREAD-Maize-A A03 MR_ (PI x PI655998)S4 PI x PI A0170B 10 BREAD Buckler BREAD-Maize-A B03 MR_ (PI x PI655998)S4 PI x PI A0381B 10 BREAD Buckler BREAD-Maize-A C03 MR_ (PI x PI655998)S4 PI x PI A0258A 10 BREAD Buckler BREAD-Maize-A D03 MR_ (PI x PI655998)S4 PI x PI A0304B 10 Cacao Buckler BREAD-Maize-A E03 Tc1536 Catie F1 04A0216A Jemmy Takrama Cacao Buckler BREAD-Maize-A F03 Tc7959 Brazil F2 04A0255A Jemmy Takrama BREAD Buckler BREAD-Maize-A G03 PI inbred 04A0217A Wenyan Zhu BREAD Buckler BREAD-Maize-A H03 PI inbred 04A0167A Wenyan Zhu BREAD Buckler BREAD-Maize-A A04 PI inbred 04P160451A Wenyan Zhu BREAD Buckler BREAD-Maize-A B04 PI17548 inbred 04A0258B Wenyan Zhu plantae BREAD Buckler BREAD-Maize-A C04 PI inbred 04A0244A Wenyan Zhu BREAD Buckler BREAD-Maize-A D04 PI inbred 04A0298A Wenyan Zhu BREAD Buckler BREAD-Maize-A E04 PI inbred 04A0293B Wenyan Zhu BREAD Buckler BREAD-Maize-A F04 PI inbred 04A0296A Wenyan Zhu 3
4 Example DNA Barcode Key Flowcell Lane barcode sample Plate# Row Column PlateName 434GFAAXX 2 CTCC M A 1 IBM1 1A01 434GFAAXX 2 TGCA M A 2 IBM1 1A02 434GFAAXX 2 ACTA M A 3 IBM1 1A03 434GFAAXX 2 GTCT M A 4 IBM1 1A04 434GFAAXX 2 GAAT M A 5 IBM1 1A05 434GFAAXX 2 GCGT M A 6 IBM1 1A06 434GFAAXX 2 TGGC M A 7 IBM1 1A07 434GFAAXX 2 CGAT M A 8 IBM1 1A08 434GFAAXX 2 CTTGA M A 9 IBM1 1A09 434GFAAXX 2 TCACC M A 10 IBM1 1A10 434GFAAXX 2 CTAGC M A 11 IBM1 1A11 434GFAAXX 2 ACAAA M A 12 IBM1 1A12 434GFAAXX 2 TTCTC M B 1 IBM1 1B01 434GFAAXX 2 AGCCC M B 2 IBM1 1B02 434GFAAXX 2 GTATT M B 3 IBM1 1B03 434GFAAXX 2 CTGTA M B 4 IBM1 1B04 434GFAAXX 2 AGCAT M B 5 IBM1 1B05 434GFAAXX 2 ACTAT M B 6 IBM1 1B06 434GFAAXX 2 GAGAAT M B 7 IBM1 1B07 434GFAAXX 2 CCAGCT M B 8 IBM1 1B08 434GFAAXX 2 TTCAGA M B 9 IBM1 1B09 434GFAAXX 2 TAGGAA unknown 1 B 10 IBM1 1B10 Notes on Names & Chromosomes Chromosomes (or contigs MUST be integers) Samples Names some Advice: NO spaces NO : Try to avoid weird characters. 4
5 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) QSeqToTagCounts Processes a Qseq file so we know what alleles (tags) are present in the the sample Handles sequence quality issue Identifies the barcodes Removes problem tags Counts tags 5
6 GBS Restriction Fragment Structure Barcode adapter Cut site Read Cut site Common adapter Accepted read Barcode adapter Cut site Read Rejected or Trimmed reads Potential chimeric sequence Barcode adapter Cut site Read Cut site Sequence Short sequence Cut site Read Cut site Common adapter Adapter dimer Barcode adapter Cut site Common adapter Sequence Processing Raw sequence data is processed into unique 64-bp sequences. For example: CTCCCAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC GTTGAACAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC Becomes: CAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC 64 2 Parameters: Restriction enzyme Different enzymes will create different sequence motifs, such as overlapping cut sites, palindromes or wobble bases. Barcode Barcode sequences must be provided to identify acceptable reads. Number of identical sequences accepted This gives investigators the option to ignore repetitive sequences or singleton reads. 6
7 TagCounts File Number of Tags Max Size of Tag x 32bp Tag Sequence Count Length (bp) CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT 64 1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT 64 2 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTTGGCACTCAAGCCCAAAACCACAGATCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGTAATTTGTTGTCTCATACCTCATACCACAGAACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTTTTCCAACCCCAAAACCCAAGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTTTCCCAAACCCCAAACCCCAGGCTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAGGGATAGGGAAGATGGGGGAGAGTGGCGGCCACGCATGGAA 64 1 CAGCAAAAAAAAAAAAAAAAAACAACAAGGAATTTGGGTATTCATTCCCCATACCCCAGGATTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACAAAAAAATTTGTTTTCTCAACCCCAAAACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 2 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGGAATTGAATCTCTCACACCTTAAAACACCGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAATTATTTGAAAGATCATTACCCTATACCACGGGGTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTGATGTCTCATACCCCATACCACAGGACTCCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTTATTTCTCATACCCCAAACCCCAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAGAATTTTATGTCTCATACCTCAAACCAAAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAATAAATTTGTTGCTCATACCCCAAACCACAGGGCTTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGCAATTTGATTCCACTTAATCTATCCCACAGAACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTGTTTCCCTAACCCCAAAACCACGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAATGAATTTGTAGTGCCAAACCCCAAACCAACGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCCAAGAAATTTGATGTCTCATACCCCAAACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAGACCAGGTAATTATTGCTCACATACATCAAACTCCAATTGCC 64 1 CAGCAAAAAAAAAAAAAAAAAAGCGCCTAACGTTTCAAAATGAATGAGTTGCCAACCAAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAGGGTTAGGAAAGATGGGTGGGAGGGGCGGGCCTGCTTGAAAT 64 1 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) 7
8 Unique Reads CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC + CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT + CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT + CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT + CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT + CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT + CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC + CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT + CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT + CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC + CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC + CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC BWA (Burrows-Wheeler Aligner) Aligns the tags in FASTA format to the reference genome Parameters: Similarity of read sequence and genome sequence. This controls the tradeoff between number of SNPs and confidence in the alignment. Default is 4 edits per sequence. Gap penalty. This controls sensitivity to indels. Default is no indels within 5bp of the read ends. Outputs a SAM Alignment There are many other aligners. BWA is fast and memory efficient, but may not be appropriate for your species 8
9 Generic Alignment (SAM) length=64count= M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count= M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count= M2I9M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCT length=64count= M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCT length=64count= M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCT length=64count= M3D47M2I11M * 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGA length=64count= M * 0 0 CCTTTCTTGGCCTGGTTCTCACTCATCTGGGCTT length=64count= M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCCCGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACACGTTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACAGGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACCCGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTCCACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 CCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M * 0 0 GCCCGTCTACACCCTTGTGTCCCATGCACGCAAGCCGCCCCA length=64count= M1I5M * 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTAT length=64count= M * 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGAT length=64count= M1I14M * 0 0 TGCCCGTCTACACGCTTGTGTCCCAT length=58count= M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGA length=64count= M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGA length=64count= M2I14M * 0 0 GCCCGTCTACACGCTTGTGTCCCATG length=64count= M * 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCC length=64count= M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGAC length=64count= M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGAC length=64count= M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGAC length=64count= M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGAC length=64count= M1I59M * 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTC length=64count= M * 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAA length=64count= M * 0 0 CTGCCCGTCTACACGCTTGTGTCCCATGCACGCA length=64count= M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAG length=64count= M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAG length=57count= M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count= M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count= M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAG length=64count= M1I47M * 0 0 TCCATTGTTGTATCTTCGATTGCAGA SAMConverter & TagsOnPhysicalMap (TOPM) TOPM is the key file to interpret tags present in a species. Contains: Tag Sequence Position Divergence from reference Polymorphisms Genetic mapping support 9
10 TagsOnPhysicalMap File CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC BWA sensitivity is pretty poor Alignment Class BWA Bowtie2 Single Best Mapping 57% 69% Multiple Mapping 17% 17% Unmapped 26% 14% BLAST about the same as Bowtie2. Code needs to be updated to parse Bowtie2. Many of the multiple mapping do NOT map with 100% identity, which suggests they can be genetically mapped. 10
11 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) Tags by Taxa chardonnay CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA
12 Reference Genome Pipeline QseqToTagCount Qseq Key files QseqToTBT TagCounts per lane TagsByTaxa files (1 per lane) BWA (Burrows- Wheeler Aligner) SAM alignment TagCountsTo FASTQ Merge TagsCounts TagCounts for species (Master Tags) Merge TagsByTaxa TagsByTaxa for species SAM convertor TagsOnPhysical Map TagsToSNP ByAlignment HapMap Process File (data structure) TagsToSNPByAlignment Tags that align to the same region are aligned against one another and SNPs and small indels are identified. Based on the alignments SNPs are propagated to specific lines having that tag into a HapMap file. Parameters: chromosomes to search for SNPs bi or tri-allelic SNPs Indels Genetic mapping support Max markers on a chromosome 12
13 HapMap Format rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRI S1_2100 A/G N N N N N S1_2163 T/C N N N N N S1_13837 T/G N N N N S1_14606 C/T N N C N S1_20601 T/A T N N N S1_68332 C/T N N N N S1_68596 A/T A N N N S1_69309 G/A N G N N S1_79955 T/G N T G T S1_79961 T/G N T T T S1_80584 G N N N N S1_80647 C/T N N N N S1_81274 T/G N N N N S1_ G/A N N N N S1_ T/G N N N N S1_ C/T N N N N S1_ T/C N N N N S1_ G/A G G A N S1_ T/G N N T N S1_ A/G N A G N S1_ C/T N N N N S1_ T/C N T N N Why another pipeline? The last maize build (21000 taxa) with the discovery pipeline took over 2 weeks. Most common alleles have been idenbfied ader the first few discovery builds Use the informabon from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate. 13
14 GBS bioinformabcs pipeline Discovery Tags by Taxa Tag Counts TOPM SNP Caller Genotypes GBS bioinformabcs pipeline Discovery Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes 14
15 GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Discovery ProducCon Tags by Taxa Tag Counts TOPM TagsOnPhysicalMap (TOPM) SNP Caller Genotypes 15
16 GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes 16
17 GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes GBS bioinformabcs pipeline Discovery ProducCon Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes Genotypes 17
18 Running the ProducBon Pipeline Required Files: Sequence file (fastq or qseq) Key file ProducBon TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a Bme HapMap files by chromosome ~7 minutes TesBng ProducBon Pipeline Compared HapMap files produced by Discovery Pipeline and ProducBon Pipeline Site Comparison: Discovery 48,139 ProducBon 47,676 Difference due to maximum 8 alleles 99.98% correlabon of genebc distance matrices 18
19 Shifting to HDF5 Hierarchical Data Format supports very large data sets and complex data structures. Widely used in climate and astromonomy communities TBT files can approach 2 Tb in size Compressed HDF5 can be 40 times smaller Access times looks very good Working to fuse TOPM, TBT, and Keyfile into one HDF5 repository Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University 19
20 Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in
21 Maize genetic variation has been evolving for 5 million years Warm Pliocene 5mya 4mya Modern Variation Begins Evolving Sister Genus Diverges Divergence from Chimps Ardipithecus 3mya Australopithecus Cold Pleistocene 2mya 1mya Zea species begin diverging Maize domesticated Homo erectus Modern Variation Begins Modern Humans What are our expectations with GBS? 21
22 High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize line Potentially all of these are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative Expectation of marker distribution Biallelic, 17% Presense / Absense, 50% Nonpolymor phic; 18% Biparental population Too Repetitiv e, 15% Presense / Absense, 50% Multialleli c, 34% Too Repetitiv e, 15% Nonpolymorp hic; 1% Across the species 22
23 Sequencing Error Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable 23
24 Reads with errors Perfect sequences: =52.5% of the 64bp sequences are perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher. Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major C Minor (50 real) G Error T Error
25 Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs 25
26 Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD BiParentalErrorCorrectionPlugin Error rate estimation, LD filters Imputation MergeIdenticalTaxaPlugin Error rate estimation, LD filters INBREDS PARTIALLY SOLVED HapMap GWAS HETEROZYGOUS NOT SOLVED YET Imputation & Phasing Kinship Distance Phylogeny LD GS Process File (data structure) Filters in TagsToSNPByAlignmentMTPlugin Only calls bi-allelic (hard coded now) Two most common alleles used Inbreeding coefficient (-mnf) If have inbred samples definitely use, very powerful for errors and paralogues Minimum minor allele frequency (-mnmaf) Very important if do not have other tools for filtering (bi-parental populations or LD) Set for >=1% if no other filter method present 26
27 MergeDuplicateSNPsPlugin When restriction sites are less than 128bp apart, we may read SNP from both directions (strands) ~13% of all sites Fusing increases coverage Fixes errors -mismat = set maximum mismatch rate -callhets = mismatch set to hets or not GBSHapMapFiltersPlugin Basic filters for coverage of sites, taxa inbreeding coefficient, and LD -mntcov = minimum taxa coverage (e.g.0.05) -mnscov = minimum site coverage, proportion of taxa with call (e.g. 0.10) -mnmaf = minimum minor allele frequency (e.g. 0.01) 27
28 GBSHapMapFiltersPlugin -mnf = minimum inbreeding coefficient (e.g. 0.9) Don t use with outcrossers -hld = require that sites are in high local LD, currently parameters are hard coded, so difficult to tune without using the code. Tests a sliding window of 100 surrounding sites, and looks for a Bonferonni corrected P<0.01 Useful but can be slow option. More work needed here. Biparental populations Limited range of alleles, expected allele frequencies, high LD 28
29 Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating 29
30 Bi-parental populations allow identification of error, and non-mendelian segregation Error Median error rate is 0.004, but there is a long tail of some high error sites Median 30
31 BiParentalErrorCorrectionPlugin -popm = REGEX population identification(e.g. Z[0-9]{3} ) -popf = population File (not implemented) instead of popm option -mxe = maximum error rate (e.g. 0.01); calculated from non-segregating populations BiParentalErrorCorrectionPlugin -mnd = distortion from expectation (e.g. 2.0); the test uses both the binomial distribution and this distortion to classify segregation. -mnpld = minimum linkage disequilibrum r 2 = 0.5; this is calculated within each population, and then the median across segregating populations is used 31
32 MergeIdenticalTaxaPlugin Fuse taxa with the same name. Useful for checks and duplicated runs. Also useful in determining error rates -xhets = exclude heterozygotes calls (e.g. true) -hetfreq= frequency between hets and homozygous calls (e.g. 0.76) Product of Filtering After filters, in maize we find error rate AA<>aa = < AA<>Aa = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species. 32
33 Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD BiParentalErrorCorrectionPlugin Error rate estimation, LD filters Imputation MergeIdenticalTaxaPlugin Error rate estimation, LD filters INBREDS PARTIALLY SOLVED HapMap GWAS HETEROZYGOUS Partially SOLVED Imputation & Phasing Kinship Distance Phylogeny LD GS Process File (data structure) Two major sources: Sampling Missing Data Low coverage often used in big genomes with inbred lines Differential coverage caused by fragment size biases Biological Region on genome not shared between lines Cut site polymorphisms We want to impute the missing sampling but not the biological 33
34 Standard Imputation Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc. These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition Some can be slow for sample sizes that we have. FastImputationBitFixedWindow Imputation approach focused on speed and large sets of taxa with some closely related individuals. Nearest neighbor approach, fixed window sizes Strengths: Very accurate <1% error, much faster than other algorithms 100X Weakness: Not good a recombination junctions, heterozgyosity Code in TASSEL not plugin, but available 34
35 Hidden Markov Model TASSEL GBS Imputation Developed by Peter Bradbury Aimed a GBS and biparental populations Hidden Markov Model Very accurate at determining boundaries Works well on Maize NAM inbred lines, and probably others. AA <> BB error rate AB > AA Most problem appears in faulty populations Available as TASSEL 4.0 plugin Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in
36 Mapping all the alleles (TagCallerAgainstAnchor) Most maize alleles have no position on the reference map Map allele presence (TagsByTaxa) versus a anchor SNP map (HapMap) 8.7M alleles were mapped in <24 hours using 100 CPU cluster Alleles Physical and genetic mapping of 8.7 million GBS alleles Gene$c&and&Physical&Agree& Gene$c&and&Physical&Disagree& Not&in&Physical,&Gene$cally& mapped& Complex&mapping&or&modest& power¤tly& Consistent&Error&or&Evenly& repe$$ve& Only 29% of alleles are simple - physical and genetic agree 55% of alleles are easily genetically mappable Reads Reads&with&strong& gene/c&and/or& BLAST&posi/on& Reads&with&weaker& posi/on&hypothesis& Reads&with&no& hypothesis&(error&or& even&repe//ve)& Many complex alleles are rarer, so 71% of alleles are genetic and/or physically interpretable. With more samples and better error models perhaps 90% will be useable 36
37 Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing Future Need better integration of Whole Genome Sequence data with pipeline Add information on premature cut sites or mutated cut sites Use paired-end read information Full incorporation of presence/absence variants Increase range of imputation tools and phasing for structure populations Quantitative genotype tools for polyploids/ GS 37
GBS Bioinformatics Pipeline(s) Overview
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the
More informationGBS Bioinformatics Pipeline(s) Overview
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from
More informationFei Lu. Post doctoral Associate Cornell University
Fei Lu Post doctoral Associate Cornell University http://www.maizegenetics.net Genotyping by sequencing (GBS) is simple and cost effective 1. Digest DNA 2. Ligate adapters with barcodes 3. Pool DNAs 4.
More informationNew imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)
New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) Kelly Swarts PAG Allele Mining 1/11/2014 Imputation is the projection
More informationGenotyping By Sequencing (GBS) Method Overview
enotyping By Sequencing (BS) Method Overview Sharon E Mitchell Institute for enomic Diversity Cornell University http://wwwmaizegeneticsnet/ Topics Presented Background/oals BS lab protocol Illumina sequencing
More informationGenotyping By Sequencing (GBS) Method Overview
enotyping By Sequencing (BS) Method Overview RJ Elshire, JC laubitz, Q Sun, JV Harriman ES Buckler, and SE Mitchell http://wwwmaizegeneticsnet/ Topics Presented Background/oals BS lab protocol Illumina
More informationGenotype Imputation. Class Discussion for January 19, 2016
Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously
More informationopulation genetics undamentals for SNP datasets
opulation genetics undamentals for SNP datasets with crocodiles) Sam Banks Charles Darwin University sam.banks@cdu.edu.au I ve got a SNP genotype dataset, now what? Do my data meet the requirements of
More informationAccounting for read depth in the analysis of genotyping-by-sequencing data
Accounting for read depth in the analysis of genotyping-by-sequencing data Ken Dodds, John McEwan, Timothy Bilton, Rudi Brauning, Rayna Anderson, Tracey Van Stijn, Theodor Kristjánsson, Shannon Clarke
More information1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:
.5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the
More informationProcesses of Evolution
15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationGenotype Imputation. Biostatistics 666
Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives
More informationHumans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase
Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs
More informationDetecting selection from differentiation between populations: the FLK and hapflk approach.
Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,
More information1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics
1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
More informationBTRY 7210: Topics in Quantitative Genomics and Genetics
BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:
More informationSolutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin
Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele
More informationIntroduction to PLINK H3ABionet Course Covenant University, Nigeria
UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG Introduction to PLINK H3ABionet Course Covenant University, Nigeria Scott Hazelhurst H3ABioNet funded by NHGRI grant number U41HG006941 Wits Bioinformatics
More informationIntroduction to Sequence Alignment. Manpreet S. Katari
Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and
More informationQ1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.
OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall
More informationHigh-throughput sequencing: Alignment and related topic
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs
More informationPopulation Genetics I. Bio
Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn
More informationMaize Genetics Cooperation Newsletter Vol Derkach 1
Maize Genetics Cooperation Newsletter Vol 91 2017 Derkach 1 RELATIONSHIP BETWEEN MAIZE LANCASTER INBRED LINES ACCORDING TO SNP-ANALYSIS Derkach K. V., Satarova T. M., Dzubetsky B. V., Borysova V. V., Cherchel
More informationExplore SNP polymorphism data. A. Dereeper, Y. Hueber
Explore SNP polymorphism data A. Dereeper, Y. Hueber Bioinformatics trainings, Supagro, February, 2016 Tablet Graphical tool to visualize assemblies Accept many formats ACE, SAM, BAM GATK (Genome Analysis
More informationEiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6
Yamamoto et al. BMC Genetics 2014, 15:50 METHODOLOGY ARTICLE Open Access Effect of advanced intercrossing on genome structure and on the power to detect linked quantitative trait loci in a multi-parent
More information(Genome-wide) association analysis
(Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by
More informationIntroduction to Linkage Disequilibrium
Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have
More informationHaplotype-based variant detection from short-read sequencing
Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale
More informationLecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012
Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium November 12, 2012 Last Time Sequence data and quantification of variation Infinite sites model Nucleotide diversity (π) Sequence-based
More informationHigh-throughput sequence alignment. November 9, 2017
High-throughput sequence alignment November 9, 2017 a little history human genome project #1 (many U.S. government agencies and large institute) started October 1, 1990. Goal: 10x coverage of human genome,
More information1. Understand the methods for analyzing population structure in genomes
MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population
More informationHapsembler version 2.1 ( + Encore & Scarpa) Manual. Nilgun Donmez Department of Computer Science University of Toronto
Hapsembler version 2.1 ( + Encore & Scarpa) Manual Nilgun Donmez Department of Computer Science University of Toronto January 13, 2013 Contents 1 Introduction.................................. 2 2 Installation..................................
More informationChapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)
12/5/14 Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) Linkage Disequilibrium Genealogical Interpretation of LD Association Mapping 1 Linkage and Recombination v linkage equilibrium ²
More informationProportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power
Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion
More informationSupplementary Information for Discovery and characterization of indel and point mutations
Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed
More informationLECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50
LECTURE #10 A. The Hardy-Weinberg Equilibrium 1. From the definitions of p and q, and of p 2, 2pq, and q 2, an equilibrium is indicated (p + q) 2 = p 2 + 2pq + q 2 : if p and q remain constant, and if
More informationBustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #
Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either
More informationChapter 13 Meiosis and Sexual Reproduction
Biology 110 Sec. 11 J. Greg Doheny Chapter 13 Meiosis and Sexual Reproduction Quiz Questions: 1. What word do you use to describe a chromosome or gene allele that we inherit from our Mother? From our Father?
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationHeterozygous BMN lines
Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30
More informationPredictive Genome Analysis Using Partial DNA Sequencing Data
Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationLinkage and Linkage Disequilibrium
Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies
More informationNotes on Population Genetics
Notes on Population Genetics Graham Coop 1 1 Department of Evolution and Ecology & Center for Population Biology, University of California, Davis. To whom correspondence should be addressed: gmcoop@ucdavis.edu
More informationComputational Approaches to Statistical Genetics
Computational Approaches to Statistical Genetics GWAS I: Concepts and Probability Theory Christoph Lippert Dr. Oliver Stegle Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen
More informationVariant visualisation and quality control
Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14
More informationAEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,
AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, Today: Review Probability in Populatin Genetics Review basic statistics Population Definition
More informationWhole Genome Alignments and Synteny Maps
Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of
More information2. Map genetic distance between markers
Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,
More informationAssociation Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5
Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative
More informationLecture WS Evolutionary Genetics Part I 1
Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in
More informationCHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection
CHAPTER 23 THE EVOLUTIONS OF POPULATIONS Section C: Genetic Variation, the Substrate for Natural Selection 1. Genetic variation occurs within and between populations 2. Mutation and sexual recombination
More informationPopulation Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda
1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;
More informationLearning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study
Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang and Xiaoyong Zhou Indiana University at Bloomington
More informationFriday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo
Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization
More informationLecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017
Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping
More informationCalculation of IBD probabilities
Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities
More informationMicrosatellite evolution in Adélie penguins
Microsatellite evolution in Adélie penguins Bennet McComish School of Mathematics and Physics Microsatellites Tandem repeats of motifs up to 6bp, e.g. (AC) 6 = ACACACACACAC Length is highly polymorphic.
More informationBIOLOGY 321. Answers to text questions th edition: Chapter 2
BIOLOGY 321 SPRING 2013 10 TH EDITION OF GRIFFITHS ANSWERS TO ASSIGNMENT SET #1 I have made every effort to prevent errors from creeping into these answer sheets. But, if you spot a mistake, please send
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationThe E-M Algorithm in Genetics. Biostatistics 666 Lecture 8
The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as
More informationComparing whole genomes
BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will
More information*: Division of Biological Sciences; University of Missouri; Columbia, MO, 65211
Genetics: Early Online, published on July 20, 2016 as 10.1534/genetics.116.191726 Fast-Flowering Mini-Maize: Seed to Seed in 60 Days Morgan E. McCaw*, Jason G. Wallace,1, Patrice S. Albert*, Edward S.
More informationLecture 9. QTL Mapping 2: Outbred Populations
Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred
More informationSNP Association Studies with Case-Parent Trios
SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature
More informationClassical Selection, Balancing Selection, and Neutral Mutations
Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected
More informationInteger Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!
Integer Programming in Computational Biology D. Gusfield University of California, Davis Presented December 12, 2016. There are many important phylogeny problems that depart from simple tree models: Missing
More informationPrinciples of QTL Mapping. M.Imtiaz
Principles of QTL Mapping M.Imtiaz Introduction Definitions of terminology Reasons for QTL mapping Principles of QTL mapping Requirements For QTL Mapping Demonstration with experimental data Merit of QTL
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationUNIT 8 BIOLOGY: Meiosis and Heredity Page 148
UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 CP: CHAPTER 6, Sections 1-6; CHAPTER 7, Sections 1-4; HN: CHAPTER 11, Section 1-5 Standard B-4: The student will demonstrate an understanding of the molecular
More informationRepeat resolution. This exposition is based on the following sources, which are all recommended reading:
Repeat resolution This exposition is based on the following sources, which are all recommended reading: 1. Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions,
More informationSupporting Information
Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider
More informationEXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?
Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 3 Exercise 3.. a. Define random mating. b. Discuss what random mating as defined in (a) above means in a single infinite population
More informationChapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.
Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype. please read pages 38-47; 49-55;57-63. Slide 1 of Chapter 2 1 Extension sot Mendelian Behavior of Genes Single gene inheritance
More informationLevels of genetic variation for a single gene, multiple genes or an entire genome
From previous lectures: binomial and multinomial probabilities Hardy-Weinberg equilibrium and testing HW proportions (statistical tests) estimation of genotype & allele frequencies within population maximum
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationProbability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies
Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National
More informationThe Lander-Green Algorithm. Biostatistics 666 Lecture 22
The Lander-Green Algorithm Biostatistics 666 Lecture Last Lecture Relationship Inferrence Likelihood of genotype data Adapt calculation to different relationships Siblings Half-Siblings Unrelated individuals
More informationEXERCISES FOR CHAPTER 7. Exercise 7.1. Derive the two scales of relation for each of the two following recurrent series:
Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 7 Exercise 7.. Derive the two scales of relation for each of the two following recurrent series: u: 0, 8, 6, 48, 46,L 36 7
More informationMicrosatellite data analysis. Tomáš Fér & Filip Kolář
Microsatellite data analysis Tomáš Fér & Filip Kolář Multilocus data dominant heterozygotes and homozygotes cannot be distinguished binary biallelic data (fragments) presence (dominant allele/heterozygote)
More informationCNV Methods File format v2.0 Software v2.0.0 September, 2011
File format v2.0 Software v2.0.0 September, 2011 Copyright 2011 Complete Genomics Incorporated. All rights reserved. cpal and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries.
More informationWeek 7.2 Ch 4 Microevolutionary Proceses
Week 7.2 Ch 4 Microevolutionary Proceses 1 Mendelian Traits vs Polygenic Traits Mendelian -discrete -single gene determines effect -rarely influenced by environment Polygenic: -continuous -multiple genes
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationThe Quantitative TDT
The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus
More informationGenetic diversity and population structure in rice. S. Kresovich 1,2 and T. Tai 3,5. Plant Breeding Dept, Cornell University, Ithaca, NY
Genetic diversity and population structure in rice S. McCouch 1, A. Garris 1,2, J. Edwards 1, H. Lu 1,3 M Redus 4, J. Coburn 1, N. Rutger 4, S. Kresovich 1,2 and T. Tai 3,5 1 Plant Breeding Dept, Cornell
More informationEvolutionary Genetics Midterm 2008
Student # Signature The Rules: (1) Before you start, make sure you ve got all six pages of the exam, and write your name legibly on each page. P1: /10 P2: /10 P3: /12 P4: /18 P5: /23 P6: /12 TOT: /85 (2)
More informationMolecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment
Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.
More informationMeiosis and Mendel. Chapter 6
Meiosis and Mendel Chapter 6 6.1 CHROMOSOMES AND MEIOSIS Key Concept Gametes have half the number of chromosomes that body cells have. Body Cells vs. Gametes You have body cells and gametes body cells
More informationAnalysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing
Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,
More informationPotato Genome Analysis
Potato Genome Analysis Xin Liu Deputy director BGI research 2016.1.21 WCRTC 2016 @ Nanning Reference genome construction???????????????????????????????????????? Sequencing HELL RIEND WELCOME BGI ZHEN LLOFRI
More informationLife Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM
Life Cycles, Meiosis and Genetic Variability iclicker: 1. A chromosome just before mitosis contains two double stranded DNA molecules. 2. This replicated chromosome contains DNA from only one of your parents
More informationBreeding Values and Inbreeding. Breeding Values and Inbreeding
Breeding Values and Inbreeding Genotypic Values For the bi-allelic single locus case, we previously defined the mean genotypic (or equivalently the mean phenotypic values) to be a if genotype is A 2 A
More informationUsing phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)
Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures
More informationEM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works
EM algorithm The example in the book for doing the EM algorithm is rather difficult, and was not available in software at the time that the authors wrote the book, but they implemented a SAS macro to implement
More information