Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Similar documents
Bacterial Genetics & Operons

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

Comparative genomics: Overview & Tools + MUMmer algorithm

Whole Genome Alignments and Synteny Maps

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence analysis and Genomics

Pan-genomics: theory & practice

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

The two daughter cells are genetically identical to each other and the parent cell.

Algorithms in Bioinformatics

Pairwise & Multiple sequence alignments

Introduction to Sequence Alignment. Manpreet S. Katari

Comparative Genomics Background and Strategies. Nitya Sharma, Emily Rogers, Kanika Arora, Zhiming Zhao, Yun Gyeong Lee

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

BACTERIA AND ARCHAEA 10/15/2012

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Molecular evolution - Part 1. Pawan Dhar BII

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Biology. Revisiting Booklet. 6. Inheritance, Variation and Evolution. Name:

Vital Statistics Derived from Complete Genome Sequencing (for E. coli MG1655)

pglo/amp R Bacterial Transformation Lab

1. The diagram below shows two processes (A and B) involved in sexual reproduction in plants and animals.

A DNA Sequence 2017/12/6 1

Lecture 4: September 19

An Adaptive Association Test for Microbiome Data

Chapter 27: Bacteria and Archaea

CHAPTER : Prokaryotic Genetics

Bio 119 Bacterial Genomics 6/26/10

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

9/8/2017. Bacteria and Archaea. Three domain system: The present tree of life. Structural and functional adaptations contribute to prokaryotic success

Sequence Database Search Techniques I: Blast and PatternHunter tools

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.

Microbial Genetics, Mutation and Repair. 2. State the function of Rec A proteins in homologous genetic recombination.

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Lecture 5,6 Local sequence alignment

MiGA: The Microbial Genome Atlas

Model plants and their Role in genetic manipulation. Mitesh Shrestha

Microbial Taxonomy and the Evolution of Diversity

Motivating the need for optimal sequence alignments...

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Outline. Viruses, Bacteria, and Archaea. Viruses Structure Classification Reproduction Prokaryotes Structure Reproduction Nutrition Bacteria Archaea

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Warm-Up. Illustrate (via model, diagram, cartoon, et cetera) how viral replication introduces genetic variation in the viral population. (LO 3.

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Sequence Alignment Techniques and Their Uses

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Inferring positional homologs with common intervals of sequences

Chapters AP Biology Objectives. Objectives: You should know...

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

The Role of the Horizontal Gene Pool and Lateral Gene Transfer in Enhancing Microbial Activities in Marine Sediments

Sequence Alignment (chapter 6)

Kingdom Monera - The Bacteria

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

What can sequences tell us?

RGP finder: prediction of Genomic Islands

Ledyard Public Schools Science Curriculum. Biology. Level-2. Instructional Council Approval June 1, 2005

Horizontal transfer and pathogenicity

L3: Blast: Keyword match basics

Collected Works of Charles Dickens

Introduction to Evolutionary Concepts

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

7 Multiple Genome Alignment

Supplementary Information

Sequence Alignment. Johannes Starlinger

Dr. Amira A. AL-Hosary

that does not happen during mitosis?

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

SPECIES OF ARCHAEA ARE MORE CLOSELY RELATED TO EUKARYOTES THAN ARE SPECIES OF PROKARYOTES.

Bioinformatics Algorithms. Physical Mapping Restriction Mapping

EECS730: Introduction to Bioinformatics

Section 19 1 Bacteria (pages )

Frontiers in Microbiology

Gibbs Sampling Methods for Multiple Sequence Alignment

Chapter 5. Evolution of Biodiversity

Bioinformatics and BLAST

Chapter 19. Microbial Taxonomy

Essentiality in B. subtilis

Science at the University of Auckland. Presented by igem

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

GSBHSRSBRSRRk IZTI/^Q. LlML. I Iv^O IV I I I FROM GENES TO GENOMES ^^^H*" ^^^^J*^ ill! BQPIP. illt. goidbkc. itip31. li4»twlil FIFTH EDITION

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

A Phylogenetic Network Construction due to Constrained Recombination

Supplementary Information

Microbes And Evolution: The World That Darwin Never Saw

Transcription:

Whole Genome Alignment Adam Phillippy University of Maryland, Fall 2012

Motivation cancergenome.nih.gov

Breast cancer karyotypes www.path.cam.ac.uk

Goal of whole-genome alignment } For two genomes, A and B, find a mapping from each position in A to its corresponding position in B CCGGTAGGATATTAAACGGGGTGAGGAGCGTTGGCATAGCA CCGCTAGGCTATTAAAACCCCGGAGGAG...GGCTGAGCA Megabase-sized sequences cannot be aligned with an O(n 2 ) algorithm like dynamic programming.

Global vs. Local alignments } Global pairwise alignment...aagcttggcttagctgctagggtaggcttggg......aagctgggcttagttgctag..taggctttgg... ^ ^ ^^ ^ } Whole genome alignment Recently had a global alignment project? What about local alignment?

Alignment Visualization

Global visualization } Gene model conservation across 3 Plasmodium species

Genome alignment visualization } How can we visualize whole genome alignments? } With an alignment dot plot } N x M matrix } Let i = position in genome A } Let j = position in genome B } Fill cell (i,j) if A i shows similarity to B j A C G T A C C T } A perfect alignment between A and B would completely fill the positive diagonal What does this remind you of?

B Translocation Inversion Insertion A B A http://mummer.sourceforge.net/manual/alignmenttypes.pdf

The look similar, what about their genomes?

Drosophila shuffling Evolution of genes and genomes on the Drosophila phylogeny. Drosophila 12 Genomes Consortium. Nature. 2007 Nov 8;450(7167):203-18.

Multiple alignment visualization Open problem for many genomes

MUMmer Aligning two genomes in under a minute

Nucmer algorithm 1. Find exact match seeds (MUMmer Suffix Tree) 2. Cluster significant matches (Union-Find) 3. Extend and combine alignments (Smith-Waterman) 4. Filter repeats (Dynamic programming) Undergrad summer research project

Suffix trees } O(n) construction } O(n) space } O(n+m) Longest common substring } O(n+m+k) Find all k maximal matches

MUMmer } Maximal Unique Matcher (MUM) } match } exact match of a minimum length } maximal } cannot be extended in either direction without a mismatch } unique } occurs only once in both sequences (MUM) } occurs only once in a single sequence (MAM) } occurs one or more times in either sequence (MEM)

Is it a MEM, MAM or MUM? R MUM : maximal unique match MAM : maximal almost-unique match MEM : maximal exact match Q MUMs inherently avoid repetitive regions, which do not make good seeds.

Clustering cluster length = m i" gap distance = C" indel difference = B A " R m1 m2 m3 A Q B C

0 G T C G T ^ G A C G T T ^ 2 1 1 2 0 2 3* 1 1 3* 4 2 1 2 4 5 3 2 1 3 5 4 2 2 2 4 6 3 3* 1 3 5 2 2 2 4 3 2 3 3* 3* Banded dynamic programming Linear time and space dynamic programming, also see Divide and Conquer Match score 0, Edit score +1, Max edits 2"

Extending Match score +3, Edit score -7" R break point = B" B score ~70% Q A break length = A" 7 matches and 3 edits over a 10 bp window = 7*3-3*7 = 0

L. monocytogenes alignment 18-mer seeds alignments Why isn t it a single diagonal?

Microbial Genomics

Comparative genomics } Study genomic content and function across different taxa } Why? } study evolution } link phenotype with genotype } reveal genomic organization and function } transfer functional annotation } How? } genome sequencing and alignment Observation and comparison yield tremendous insight.

Microbes are underappreciated } They re everywhere } Harmful } disease, spoiling } Beneficial } human microbiota } bio-energy, bio-remediation } synthetic genomics } Easy to work with } rapid generation time } small genomes } extremely efficient } simpler models 10 14 bacterial cells vs. 10 13 human cells

Listeria monocytogenes } Listeria monocytogenes } Important foodborne pathogen (cheese, lunch meat, etc.) } 3 Mbp genome, 3 primary lineages (I,II,III*) Julie Theriot, Stanford

LII LI LIII spp. bile HGT bile resistance genes

Bacteria have sex } A few mechanisms of horizontal gene transfer } Transformation: the genetic alteration of a cell resulting from the introduction, uptake, and expression of foreign genetic material (DNA or RNA). } Transduction: the process in which bacterial DNA is moved from one bacterium to another by a virus (e.g. phage) } Bacterial conjugation: a process in which genetic material is transferred to another cell by cell-to-cell contact. Wikipedia

Pan-genomics } Core genome } minimal gene set necessary for survival } defining characteristics of the species } orthologs, gene groups } Accessory genome } mediate adaptation to different environments } e.g. stress and antibiotic resistance, nutrient metabolism } Pan genome } union of core and accessory genes (non-redundant) } defines total genetic diversity of the species

How big is a pan-genome? } How many new genes will be discovered in sequencing the k th genome? } For all k! possible permutations of k genomes } how many new genes are found in the k th genome? } Perform regression on the average values FOR k = 1 to N FOR each random sample Randomly generate an ordered set of k genomes Compute # unique genes in the k th genome END FOR END FOR

L. monocytogenes pan genome } Power law non-linear least squares fit to means n New Genes 0 100 200 300 400 500 means 346N 0.85 n Pan Genes 2500 3000 3500 4000 4500 means 2850N 0.12 5 10 15 20 N Genomes 5 10 15 20 N Genomes Heap s law. Open pan-genome? Consequences for antibiotic resistance?

L. monocytogenes core genome } Exponential decay non-linear least squares fit to means n Core Genes 2300 2400 2500 2600 2700 2800 2900 means 433e ( 0.35N) + 2456 5.3N 427e ( 0.16N) + 2330 0 5 10 15 20 25 30 N Genomes Draft genomes missing ~5 genes per genome