Comparative genomics: Overview & Tools + MUMmer algorithm

Similar documents
Whole Genome Alignments and Synteny Maps

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

2 Genome evolution: gene fusion versus gene fission

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

Introduction to Bioinformatics Integrated Science, 11/9/05

Pairwise & Multiple sequence alignments

Essentiality in B. subtilis

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Comparative Bioinformatics Midterm II Fall 2004

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Genomics and bioinformatics summary. Finding genes -- computer searches

Translation and Operons

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Bioinformatics Exercises

Computational methods for predicting protein-protein interactions

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Computational approaches for functional genomics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

SUPPLEMENTARY INFORMATION

Chapter 15 Active Reading Guide Regulation of Gene Expression

Prokaryotic Gene Expression (Learning Objectives)

BIOINFORMATICS: An Introduction

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

7 Multiple Genome Alignment

Control of Prokaryotic (Bacterial) Gene Expression. AP Biology

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

# shared OGs (spa, spb) Size of the smallest genome. dist (spa, spb) = 1. Neighbor joining. OG1 OG2 OG3 OG4 sp sp sp

Regulation of Gene Expression

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

RGP finder: prediction of Genomic Islands

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

MiGA: The Microbial Genome Atlas

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Comparative Genomics Background and Strategies. Nitya Sharma, Emily Rogers, Kanika Arora, Zhiming Zhao, Yun Gyeong Lee

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Genomes and Their Evolution

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Comparing whole genomes

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence analysis and comparison

Principles of Genetics

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

GENE REGULATION AND PROBLEMS OF DEVELOPMENT

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Tools and Algorithms in Bioinformatics

Sequence Alignment Techniques and Their Uses

Cladistics and Bioinformatics Questions 2013

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

Prokaryotic Gene Expression (Learning Objectives)

Introduction to Bioinformatics

Introduction to Bioinformatics

CGS 5991 (2 Credits) Bioinformatics Tools

Bioinformatics Chapter 1. Introduction

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Bioinformatics course

Introduction to Molecular and Cell Biology

Big Idea 3: Living systems store, retrieve, transmit and respond to information essential to life processes. Tuesday, December 27, 16

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Using Bioinformatics to Study Evolutionary Relationships Instructions

The Gene The gene; Genes Genes Allele;

Supplemental Materials

Example of Function Prediction

Sequence Alignment (chapter 6)

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Sequencing alignment Ameer Effat M. Elfarash

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Chapter 18: Control of Gene Expression

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Bacterial Genetics & Operons

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Molecular evolution - Part 1. Pawan Dhar BII

BIOINFORMATICS LAB AP BIOLOGY

Homology and Information Gathering and Domain Annotation for Proteins

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

This document describes the process by which operons are predicted for genes within the BioHealthBase database.

Bioinformatics for Biologists

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

CSCE555 Bioinformatics. Protein Function Annotation

Processes of Evolution

PROTEIN SYNTHESIS INTRO

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Transcription:

Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in

Genome sequence: Fact file 1995: The first complete genome sequence of Haemophilus infuenzae Rd-was published Biological systems are dynamic and evolving The forth dimension: Time Genome sequence is a snapshot of evolution Correlation between Phenotypic properties and Genomic region is not straightforward as phenotypic properties are result of many to many interactions 2

Genomes: the current status Published complete genomes: 403 Ongoing:» Archaeal: 81» Bacterial: 1226» Eukaryal: 169» Archaeal: 107» Prokaryotic: 3478» Eukaryotic: 1209 GOLD database Metagenomics:203 As of Viral: >4500 3

Genome databases Genomes at NCBI, EBI, TIGR 4

H. influenzae Complete Genome 5

Function information clock of E. coli Generated on March 2K4 6

Comparison of the coding regions Begins with the gene identification algorithm: infer what portions of the genomic sequence actively code for genes. There are four basic approaches. 7

Knowledge of Full Genome sequence: Solutions or new questions? Correct # of genes? Still struggling with the gene counters 8

Genome analyses Variation in Genome size GC content Codon usage Amino acid composition Genome organisation Single circular chromosomes E. coli: 4.6Mbp M. pneumoniae: 0.81Mbp B. subtilis: 4.20Mbp B. burgdorferi: 29% M. tuberculosis: 68% G, A, P, R: GC rich I, F, Y, M, D: AT rich Linear chromosome + extra chromosomal elements 9

CG: Comparisons between genomes The stains of the same species The closely related species The distantly related species List of Orthologs Evolution of individual genes Evolution of organisms 10

11

CG helps to ask some interesting questions Identification similarities/differences between genomes may allow us to understand : How 2 organisms evolved? Why certain bacteria cause diseases while others do not? Identification and prioritization of drug targets 12

CG: Unit of comparison Unit of comparison: Gene/Genome Number Content (sequence) Location (map position) Gene Order Gene Cluster (Genes that are part of a known metabolic pathway, are found to exist as a group) Colinearity of gene order is referred as synteny A conserved group of genes in the same order in two genomes as a syntenic groups or syntenic clusters Translocation: movement of genomic part from one position to another 13

Dandekar et al., 1998 Structure of tryptophan Numbers: Gene operon number Arrows: Direction of transcription //: Dispersion of operon by 50 genes Domain fusion trpd and trpg trpf and trpc trpb and trpa genetically linked separate genes 14

Important observations with regard to Gene Order Order is highly conserved in closely related species but gets changed by rearrangements With more evolutionary distance, no correspondence between the gene order of orthologous genes Group of genes having similar biochemical function tend to remain localized Genes required for synthesis of tryptophan (trp genes) in E. coli and other prokaryotes 15

Synteny Refers to regions of two genomes that show considerable similarity in terms of sequence and conservation of the order of genes likely to be related by common descent. 16

COGs: Phylogenetic classification of proteins encoded in complete genomes 17

Genome analyses@ncbi Pairwise genome comparison of protein homologs (symmetrical best hits) http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi 18

Integr8: CG site at EBI http://www.ebi.ac.uk/integr8 19

Comparative Genomics Tools BLAST2 MUMmer PipMaker AVID/VISTA Comparisons and analyses at both Nucleic acid and protein level 20

BLAST2 Available at NCBI Input: GI or FASTA sequence (range can be specified) Output: Graphical Alignment of 2 genomes 21

Genome Alignment Algorithm: MUMmer Developed by Dr. Steven Salzberg s group at TIGR NAR (1999) 27:2369-2376 NAR (2002) 30:2478-2483 Availability Free TIGR site 22

Features of MUMmer The algorithm assumes that sequences are closely related Can quickly compare millions of bases Outputs: Base to base alignment Highlights the exact matches and differences in the genomes Locates SNPs Large inserts Significant repeats Tandem repeats and reversals 23

Definitions are drawn from biology SNP: Single mutation surrounded by two matching regions Regions of DNA where 2 sequences have diverged by more than one SNP Large inserts: regions inserted into one of the genomes Sequence reversals, lateral gene transfer Repeats: the form of duplication that has occurred in either genome. Tandem repeats: regions of repeated DNA in immediate succession but with different copy number in different genomes. A repeat can occur 2.5 times 24

Techniques used in the MUMmer Algorithm Compute Suffix trees for every genome Longest Increasing Subsequence (LIS) Alignment using Smith & Waterman algorithm Integration of these techniques for genome alignment 25

MUMmer: Steps in the alignment process Read two genomes Perform Maximum Unique Match (MUM) of genomes Using SNPs, mutation regions, repeats, tandem repeats Close the gaps in the Alignment Sort and order the MUMs using LIS Output alignment MUMs regions that do not match exactly 26

MUMmer steps Locating MUMs Sorting MUMs Closure with gaps G1: ACTGATTACGTGAACTGGATCCA G2: ACTCTAGGTGAAGTGATCCA 27

Genome1: ACTGATTACGTGAACTGGATCCA Genome2: ACTCTAGGTGAAGTGATCCA Genome1: ACTGATTACGTGAACTGGATCCA Genome2: ACTCTAGGTGAAGTGATCCA ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGT-GATCCA 28

What is a MUM? MUM is a subsequence that occurs exactly once in both genomes and is NOT part of any longer sequence Two characters that bound a MUM are always mismatches GenA: tcgatcgacgatcgccgccgtagatcgaataacgagagagcataacgactta GenB: gcattagacgatcgccgccgtagatcgaataacgagagagcataatccagag Principle: if a long matching sequence occurs exactly once in each genome, it is certainly to be part of global alignment Similar to BLAST & FASTA!! 29

Sorting & ordering MUMs MUMs are sorted according to their position in Genome A The order of matching MUMs in Genome B is considered MUM3: Random match Inexact repeat 2 4 MUM5: transposition LIS algorithm to locate longest set of MUMs which occur in ascending order in both genomes Leads to Global MUM-alignment 30

MUMmer Results 2 strains of M. tuberculosis H37Rv & CDC1551 Genome size: 4Mb Time: 55 s Generating suffix tree: 5 s Sorting MUMs: 45s S&W alignment: 5 s 31

Alignment of M. tuberculosis strains CDC1551 (Top) & H37Rv (bottom) Single green lines indicate SNPs Blue lines indicate insertions 32

Comparison of 2 Mycoplasma genomes cousins that are distantly related M. genitalium: 580 074 nt M. pneumoniae: 816 394 (+226 000) Analysis of proteins tell us that all M.g. proteins are present in P.m. Alignment was carried using FASTA (dividing each genome into 1000 bp) All-against-all searches Fixed length of pattern (25) Using MUMmer (length = 25) 33

Comparison of 2 Mycoplasma genomes Using FASTA Fixed length patterns: 25mers MUMmer 34

Post-sequencing challenges Genome sequencing is just the beginning to appreciate biocomplexity Sequence-based function assignment approaches fail as the sequence similarity drops Structure-based function prediction approaches are limited by the availability of structures, association of structural motifs & associated functional descriptor As a result, in any genome, Genes with known function: ~ 40% Genes with unknown function: ~60% 35