Multiple sequence alignments

Similar documents
THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple sequence alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Copyright 2000 N. AYDIN. All rights reserved. 1

Tools and Algorithms in Bioinformatics

Quantifying sequence similarity

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Phylogenetic Tree Reconstruction

Sequence analysis and comparison

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Pairwise & Multiple sequence alignments

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

An Introduction to Sequence Similarity ( Homology ) Searching

Phylogenetic inference

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Multiple Sequence Alignment. Sequences

Week 10: Homology Modelling (II) - HHpred

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Tools and Algorithms in Bioinformatics

Introduction to Bioinformatics Online Course: IBT

Single alignment: Substitution Matrix. 16 march 2017

Ch. 9 Multiple Sequence Alignment (MSA)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Effects of Gap Open and Gap Extension Penalties

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Multiple sequence alignment

Genomics and bioinformatics summary. Finding genes -- computer searches

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Moreover, the circular logic

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Similarity or Identity? When are molecules similar?

Comparative Bioinformatics Midterm II Fall 2004

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Molecular Evolution and Phylogenetic Tree Reconstruction

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Computational Biology

SUPPLEMENTARY INFORMATION


Evolutionary Tree Analysis. Overview

Multiple Sequence Alignment

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Multiple sequence alignment

In-Depth Assessment of Local Sequence Alignment

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Motivating the need for optimal sequence alignments...

Sequence Alignment Techniques and Their Uses

EVOLUTIONARY DISTANCES

Comparative genomics: Overview & Tools + MUMmer algorithm

Introduction to Evolutionary Concepts

Sequence analysis and Genomics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

EECS730: Introduction to Bioinformatics

Similarity searching summary (2)

BINF6201/8201. Molecular phylogenetic methods

Whole Genome Alignments and Synteny Maps

Phylogeny: building the tree of life

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

Some Problems from Enzyme Families

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

How to read and make phylogenetic trees Zuzana Starostová

Dr. Amira A. AL-Hosary

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Large-Scale Genomic Surveys

Bio nformatics. Lecture 23. Saad Mneimneh

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

Practical considerations of working with sequencing data

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

What is Phylogenetics

Introduction to Bioinformatics Introduction to Bioinformatics

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Algorithms in Bioinformatics

Phylogenetics: Building Phylogenetic Trees

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

A profile-based protein sequence alignment algorithm for a domain clustering database

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Phylogeny: traditional and Bayesian approaches

Session 5: Phylogenomics

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Overview Multiple Sequence Alignment

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Copyright notice. Multiple sequence alignment. Multiple sequence alignment: outline. Multiple sequence alignment: today s goals

Transcription:

Multiple sequence alignments Special thanks to all the scientis that made public available their presentations throughout the web from where many slides were taken to eleborate this presentation Web sites used in our practice Figures are linked to their corresponding web sites What is a Multiple Sequence Alignment? Sequence Retrieval system RSA Tools ClustalW BLAST Structural Criteria Residues are arranged so that those playing a similar role end up in the same column. Evolutive Criteria Residues are arranged so that those having the same ancestor end up in the same column. Similarity Criteria As many similar residues as possible in the same column Enrique Merino, IBT-UNAM 1

2,000,000,000 years Alineamientos múltiples de secuencias Multiple sequence alignments Multiple sequence alignments Seems a simple extension: Align k sequences at the same time. AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA y z x Unfortunately, this can get very expensive. For more than eight proteins of average length, the problem is non-computable given current computer power. Therefore, all of the methods capable of handling larger problems in practical timescales make use of heuristics. Aligning N sequences of length L requires a matrix of size L N, where each square in the matrix has 2 N -1 neighbors This gives a total time complexity of O(2 N L N ) What is a Multiple Sequence Alignment? The MSA contains what you put inside You can view your MSA as: A record of evolution A summary of a protein family A collection of experiments made for you by Nature a MSA is a MODEL Enrique Merino, IBT-UNAM 2

What Is A Multiple Sequence Alignment? Why Is It Difficult To Compute A multiple Sequence Alignment? It Indicates the RELATIONSHIP between residues of different sequences. It REVEALS -Similarities -Inconsistencies Multiple Alignments are CENTRAL to MOST Bioinformatics Techniques. A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * Why Is It Difficult To Compute A multiple Sequence Alignment? Multiple Alignments: What Are They Good For? BIOLOGY COMPUTATION CIRCULAR PROBLEM... Good Sequences Good Alignment Enrique Merino, IBT-UNAM 3

Multiple Sequence Alignment Derived Information motif Frequency (identity) matrices fingerprint Profile(gapped weight matrix) Multiple Sequence Alignment clasification Simultaneous As opposed to [Simultaneous: they simultaneously use all the information] Exact As opposed to Heursistic [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] RWDAGCVN RWDSGCVN RWHHGCVQ RWKGACYN RWLWACEQ Position-specific weight matrices (blocks) Hidden Markov model (HMM) Stochastic As opposed to Determinist [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] Iterative As opposed to Non Iterative R-W-x(2)-[AG]-C-x-[NQ] Regular expression (pattern) [Iterative: run the same algorithm many times] [Iterative: Most stochastic methods are iterative] Exhaustive methods Heuristic methods The Correct Alignment Correct according to optimality criteria Always Not always Correct according to homology Not always Not always GAs Simultaneous MSA POA HMMs Iteralign Prrp DCA Combalign OMA GA SAM HMMer Iterative Clustal T-Coffee Praline MAFFT SAGA Dialign Non tree based Enrique Merino, IBT-UNAM 4

Simultaneous DCA Clustal MSA Combalign T-Coffee In any case, MSA consider the evolution of each column as independent process POA Iteralign Prrp OMA GA SAM HMMer Iterative Praline MAFFT SAGA Dialign Stochastic How close to reality is this assumption? 3D protein models can be evaluated based on the co-evolution of their interacting residues A B The presence of 'correlated positions' between pairs of positions in pairs of multiple sequence alignments can be used in predicting intra-protein and proteinprotein interactions. A B A B A B A B Enrique Merino, IBT-UNAM 5

Multiple sequence alignments. Clustal W Step 1 Pairwise Alignment. Compare each sequence with each other calculate a distance matrix human EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN 480 Dog EYSGSSEKIDLMASDPQDAFICESERVHTKPVGGNIEDKIFGKTYRRKASLPKVSHTTEV 477 mouse GGFSSSRKTDLVTPDPHHTLMCKSGRDFSKPVEDNISDKIFGKSYQRKGSRPHLNHVTE 476 Julie D.Thompson, Desmond G.Higgins and Toby J.Gibson. Nucleic Acids Research, 1994, Vol. 22, No. 22 4673-4680 SeqA Name Len(aa) SeqB Name Len(aa) Identity 1 human 60 2 dog 60 77% 1 human 60 3 mouse 59 61% 2 dog 60 3 mouse 59 52% Compare each sequence with each other calculate a distance matrix Multiple sequence alignments. Clustal W Step 1 Pairwise Alignment. Compare each sequence with each other calculate a distance matrix Different sequences H D M - 0.76-0.61 0.52 - H D M Distance = Number of exact matches divided by the sequence length (ignoring gaps). Thus, the higher the number the more closely related the two sequences are. In this distance matrix, the sequence of Human is 76% identical to the sequence of Dog Enrique Merino, IBT-UNAM 6

Step 2 Create Guide Tree. Use the results of the distance matrix to create a Guide Tree to help determine in what order the sequences will be aligned. H D M Multiple sequence alignments. Clustal W Initially the guide Trees were calculated using the UPGMA method. The current version uses the Neighbour-Joining method which gives better estimates of individual branch lengths. - 0.76-0.61 0.52 - H D M Guide Tree, or Dendrogram has no phylogenetic meaning Cannot be used to show evolutionary relationships H D M Guide Tree Step 3 Alignment Follow the Guide Tree and align the sequences A B C Multiple sequence alignments. Clustal W 1. Align Human and Dog first 2. Add sequence Mouse to the previous alignment of Human and Dog Align the most closely related sequences first, then add in the more distantly related ones and align them to the existing alignment, inserting gaps if necessary Multiple sequence alignments. Clustal W By the time the most distantly related sequences are aligned, one already has a sample of aligned sequences which gives important information about the variability at each position Multiple sequence alignments. Clustal W Gap treatment Short stretches of 5 hydrophilic residues often indicate loop or random coil regions (not essential for structure) and therefore gap penalties are reduced for such stretches. Gap penalties for closely related sequences are lowered compared to more distantly related sequences ( once a gap always a gap rule). It is thought that those gaps occur in regions that do not disrupt the structure or function. Alignments of proteins of known structure show that proteins gaps do not occur more frequently than every eight residues. Therefore penalties for gaps increase when required at 8 residues or less for alignment. This gives a lower alignment score in that region. A gap weight is assigned after each aa according the frequency that such a gap naturally occurs after that aa in nature Enrique Merino, IBT-UNAM 7

Multiple sequence alignments. Clustal W Amino acid weight matrices As we know, there are many scoring matrices that one can use depending on the relatedness of the aligned proteins. As the alignment proceeds to longer branches the aa scoring matrices are changed to accommodate more divergent sequences. The length of the branch is used to determine which matrix to use. Similar sequences with "hard" matrices (BLOSUM80) Distant sequences with "soft" matrices (BLOSUM50) Multiple sequence alignments. Clustal W Relative contribution of each pairwise alignment to the global alignment score Sequences are weighted to compensate for bias of redundant elements in the alignment Flowchart of computation steps in Clustal http://www.clustal.org/ Pairwise Alignment: Calculation of distance matrix Creation of unrooted Neighbor-Joining Tree Rooted NJ Tree (guide tree) and calculation of sequence weights alignment following the Guide Tree Enrique Merino, IBT-UNAM 8

ClustalW http://www.ch.embnet.org/software/clustalw.html Multiple sequence alignments. TCoffee Multiple sequence alignments. TCoffee T-Coffee: Mixing Local and Global Alignments Regular progressive alignment strategy may produce alignment errors Local Alignment Global Alignment The global alignments are constructed using ClustalW on the sequences, two at a time Extension Library Based Multiple Sequence Alignment Multiple Sequence Alignment The local alignments are the ten top scoring nonintersecting local alignments, between each pair of sequences, gathered using the Lalign program (which is a variant of the Smith and Waterman Method) of the FASTA package Enrique Merino, IBT-UNAM 9

T-Coffee: Primary Library T-Coffee: Analysis of Consistency In the library, each alignment is represented as a list of pair-wise residue matches, each of these pairs is a constraint. All of these constraints are not equally important. This data is taken into account when computing the multiple alignment and give priority to the most reliable residue pairs We enormously increase the value of the information in the library by examining the consistency of each pair of residues with residue pairs from all of the other alignments. For each pair of aligned residues in the library, we can assign a weight that reflects the degree to which those residues align consistently with residues The Triplet Assumption SEQ A Y Y SEQ B Z Consistency Consensus ClustalW T-Coffee Enrique Merino, IBT-UNAM 10

Alignment T-Coffee: Alignmed sequences using Extend Library Dynamic Programming Using An Extended Library T-Coffee and Concistency Mixing Heterogenous Data With T-Coffee Local Alignment Global Alignment Each Library Line is a Soft Constraint (a wish) Multiple Alignment You can t satisfy them all You must satisfy as many as possible (The easy ones) Specialist Structural Multiple Sequence Alignment Enrique Merino, IBT-UNAM 11

T-Coffee and Consistency (Summary) The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. 3D-Coffee Why Do We Want To Mix Sequences and Structures? This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Sequences are Cheap and Common. Structures are Expensive and Rare. 3D-Coffee Why Do We Want To Mix Sequences and Structures? Cheapest Structure determination: Sequence-Structure Alignment Convincing Alignment Same Fold 3D-Coffee Why Do We Want To Mix Sequences and Structures? Distant sequences are hard to align THREAD Or ALIGN ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Enrique Merino, IBT-UNAM 12

3D-Coffee Why Do We Want To Mix Sequences and Structures? 3D-Coffee Why Do We Want To Mix Sequences and Structures? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * Multiple Sequence Alignments Help Exploring the Twilight Zone Structure Superposition 3D-Coffee Why Do We Want To Mix Sequences and Structures? http://www.tcoffee.org/projects_home_page/t_coffee_home_page.html Conclusion -Structures Help BUT NOT SO MUCH Enrique Merino, IBT-UNAM 13

http://www.tcoffee.org/ MUSCLE Multiple Sequence Alignment with reduced time and space complexity MUSCLE Multiple Sequence Alignment with reduced time and space complexity Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied Three stages At end of each stage, a multiple alignment is available and the algorithm can be terminated MUSCLE Multiple Sequence Alignment with reduced time and space complexity Draft Improved Refinement Enrique Merino, IBT-UNAM 14

MUSCLE. Stage 1. Draft 1.1. Similarity measure and Distance estimate Draft Improved MUSCLE. Stage 1. Draft 1.1. Similarity measure and Distance estimate The goal of the first stage is to produce a multiple alignment, emphasizing speed over accuracy Calculated using k-mer counting. A kmer is a contiguous subsequence of length k, also known as a word or k-tuple. Related sequences tend to have more kmers in common than expected by chance Refinement k-mer: ATG CCA ACCATGCGAATGGTCCACAATG score: 3 2 MUSCLE. Stage 1. Draft 1.1. Similarity measure and Distance estimate Based on the pairwise similarities, a triangular distance matrix is computed. MUSCLE. Stage 1. Draft 1.2. Tree construction using UPGMA Draft 1 1 2 3 0.5 0.7 4 0.3 Improved 2 0.2 0.8 3 4 0.6 Refinement Enrique Merino, IBT-UNAM 15

1 4 2 3 MUSCLE. Stage 1. Draft 1.2. Tree construction using UPGMA From the distance matrix we construct a tree using the UPGM method 1 4 2 3 1 4 2 3 1 2 3 4 1 0.5 0.7 0.3 2 3 4 0.2 0.8 0.6 MUSCLE. Stage 1. Draft 1.2. Tree construction using UPGMA (Unweighted Pair Group Method with Arithmetic mean) One of the fastest and tree construction methods Is a simple agglomerative or bottom-up data clustering method UPGMA assumes a constant rate of evolution (molecular clock hypothesis). At each step, the nearest 2 clusters are combined into a higher-level cluster. The distance between any 2 clusters A and B is taken to be the average of all distances between pairs of objects "a" in A and "b" in B. MUSCLE. Stage 1. Draft MUSCLE. Stage 1. Draft 1.3. alignment. Draft Improved 1.3. alignment. A progressive alignment is built by following the branching order of the tree, yielding a multiple alignment of all input sequences at the root. The alignment is done by profiles Profile-profile alignment Refinement Enrique Merino, IBT-UNAM 16

MUSCLE. Stage 1. Draft 1.3. alignment. A progressive alignment is built by following the branching order of the tree, yielding a multiple alignment of all input sequences at the root. The alignment is done by profiles alignment MUSCLE. Stage 2. Improved 2.1. Similarity measure and Distance estimate Draft Improved Refinement MUSCLE. Stage 2. Improved 2.1. Similarity measure and Distance estimate MUSCLE. Stage 2. Improved 2.1. Similarity measure and Distance estimate The main source of error in the draft progressive stage is the approximate kmer distance measure, which results in a suboptimal tree. MUSCLE therefore re-estimates the tree using the Kimura distance, which is more accurate but requires an alignment Enrique Merino, IBT-UNAM 17

MUSCLE. Stage 2. Improved 2.2. Tree construction using UPGMA Draft MUSCLE. Stage 2. Improved 2.2. Tree construction using UPGMA A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it Improved 1 1 2 3 4 2 Refinement 3 4 MUSCLE. Stage 2. Improved 2.3. alignment Draft MUSCLE. Stage 2. Improved 2.3. alignment A new progressive alignment is built Improved Refinement New Alignment 2 4 2 4 1 3 1 3 2 4 1 3 Enrique Merino, IBT-UNAM 18

MUSCLE. Stage 2. Improved 2.4. Tree comparison The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed. If Stage 2 has executed more than once, and the number of changed nodes has not decreased, the process of improving the tree is considered to have converged and iteration terminates. MUSCLE. Stage 3. Refinement. Refinement is performed iteratively MUSCLE. Stage 3. Refinement. 3.1. Delete edge from the Tree. Draft MUSCLE. Stage 3. Refinement. 3.1. Delete edge from the Tree. Choice of bipartition An edge is removed from the tree, dividing the sequences into two disjoint subsets Improved Refinement 2 3 4 5 1 4 5 1 2 3 Enrique Merino, IBT-UNAM 19

MUSCLE. Stage 3. Refinement. 3.2. Compute subtree profiles. MUSCLE. Stage 3. Refinement. 3.2. Compute subtree profiles. Draft The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed Improved Refinement 1 2 3 4 5 TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA TCA--GA G--ATAC T--CTGC TCCAA TCAAA MUSCLE. Stage 3. Refinement. 3.3. Re-align profiles. MUSCLE. Stage 3. Refinement. 3.3. Re-align profiles. Draft The two profiles are then realigned with each other using profile-profile alignment. Improved Refinement TCCAA TCAAA TCA--GA G--ATAC T--CTGC T--CCAA T--CAAA TCA--GA G--ATAC T--CTGC Enrique Merino, IBT-UNAM 20

MUSCLE. Stage 3. Refinement. 3.4. Accept/Reject. MUSCLE. Stage 3. Refinement. 3.4. Accept/Reject. Draft The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded. Improved Refinement New T--CCAA T--CAAA TCA--GA G--ATAC T--CTGC OR Old TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC MUSCLE. Stage 3. Refinement. 3.4. Accept/Reject. MUSCLE Multiple Sequence Alignment with reduced time and space complexity 1234 Score of alignment ACGT match=1 ACGA mismatch=0 AGGA T-coffee 1: A-A + A-A + A-A = 1+1+1 = 3 2: C-C + C-G + C-G =1+0+0 = 1 3: G-G + G-G + G-G = 1+1+1 = 3 4: T-A + T-A + A-A = 0+0+1 =1 MUSCLE S(alignment) = S(1) + S(2) + S(3) + S(4) = 3+1+3+1 = 8 The higher the score, the better the alignment Enrique Merino, IBT-UNAM 21

An incorrect conclusion may come from a sequence alignment using incorrect assumptions Identification of TRAP orthologs as an example of the risk of common mistakes in the analysis Supose you want to align a set of MtrB sequences retrived by gene name fromncbi Angela Valbuzzi and Charles Yanofsky. SCIENCE VOL 293 14 SEPTEMBER 2001 Computatonal Genomic group Insted of Ribosome, attenuation is mediated by an RNA binding protein called TRAP (trp RNA-Binding Attenuation Protein ) In Bacillus subtilis the trp operon is also regulated by transcription attenuation Secuencias TRAP (TRyptophan Attenuation Protein) TRAP is form of 11 identical subunits Biología Computacional Biología Computacional Enrique Merino, IBT-UNAM 22

An incorrect conclusion may come from a sequence alignment using incorrect assumptions An incorrect conclusion may come from a sequence alignment using incorrect assumptions Supose you want to align a set of MtrB sequences retrived by gene name fromncbi MtrB [Desulfobacterium autotrophicum HRM2] Signal transduction histidine kinase, nitrate/nitrite-specific MtrB [Bacillus amyloliquefaciens FZB42] Tryptophan RNA-binding attenuator protein Never forget that MSA is just a model that performs on a set of sequences given by the user Enrique Merino, IBT-UNAM 23

Exercise: Multiple sequence alignment Use multiple sequence alignment to analyze how our model antitrap align with their corresponding likely long distant homologs. Use multiple sequence alignment to analyze how antitrap align with their corresponding likely long distant homologs Sequence search based on antitrap Protein sequence Use multiple sequence alignment to analyze how antitrap align with their corresponding likely long distant homologs Use multiple sequence alignment to analyze how antitrap align with their corresponding likely long distant homologs Enrique Merino, IBT-UNAM 24

Use multiple sequence alignment to analyze how antitrap align with their corresponding likely long distant homologs >gi 16077322 ref NP_388135.1 hypothetical protein BSU02530 [Bacillus subtilis subsp. subtilis str. 168] MVIATDDLEVACPKCERAGEIEGTPCPACSGKGVILTAQGYTLLDFIQKHLNK >gi 52078761 ref YP_077552.1 inhibitor of TRAP, regulated by T-box (trp) sequence RtpA [Bacillus licheniformis ATCC 14580] MVIATDDLETTCPNCNGSGREEPEPCPKCSGKGVILTAQGSTLLHFIKKHLNE >gi 154684753 ref YP_001419914.1 RtpA [Bacillus amyloliquefaciens FZB42] MTGDGQTIKKGGIFMVIATDDLELTCPHCEGTGEEKEGTPCPKCGAKGVILTAQGNTLLHFIRKHIDQ >gi 116749904 ref YP_846591.1 hypothetical protein Sfum_2476 [Syntrophobacter fumaroxidans MPOB] MVRMRLPELETKCWMCWGSGKIASEDHGGGMECPECGGVGWLPTADGRRLLDFVQRHLGIVEEGEDNETL >gi 221194637 ref ZP_03567694.1 chaperone protein DnaJ [Atopobium rimae ATCC 49626] MASMNEKDYYVILEVSETATTEEIRKAFQVKARKLHPDVNKAPDAEARFKEVSEAYAVLSDEGKRRRYDA MRSGNPFAGGYGPSGSPAGSNSYGQDPFGWGFPFGGVDFSSWRSQGSRRSRAYKPQTGADIEYDLTLTPM QAQEGVRKGITYQRFSACEACHGSGSVHHSEASSTCPTCGGTGHIHVDLSGIFGFGTVEMECPECEGTGH VVADPCEACGGSGRVLSASEAVVNVPPHAHDGDEIRMEGKGNAGTNGSKTGDFVVRVRVPEEQVTLRQSM GARAIGIALPFFAVDLATGASLLGTIIVAMLVVFGVRNIVGDGIKRSQRWWRNLGYAVVNGALTGIAWAL VAYMFFSCTAGLGRW >gi 224372791 ref YP_002607163.1 chaperone protein DnaJ [Nautilia profundicola AmH] MDYYEILGVERTATKVEIKKAYRKLAMKYHPDKNPGDKEAEEMFKKINEAYQVLSDDEKRAIYDKYGKEG LEGQGFKTDFDFGDIFDMFNDIFGGGFGGGRAEVQMPYDIDKAIEVTLEFEEAVYGVSKEIEINYFKLCP KCKGSGAEEKETCPSCHGRGTIIMGNGFMRISQTCPQCSGRGFIAKKVCNECRGKGYIVESETVKVDIPA GIDTGMRMRVKGRGNQDISGYRGDLYLIFNVKESKIFKRKGNNLIVEVPIFFTSAILGDTVKIPTLSGEK EIEIKPHTKDNTKIVFRGEGIADPNTGYRGDLIAILKIVYPKKLTDEQRELLEKLHKSFGGEIKEHKSIL EEAIDKVKSWFKGS >gi 57867036 ref YP_188723.1 dnaj protein [Staphylococcus epidermidis RP62A] MAKRDYYEVLGVNKSASKDEIKKAYRKLSKKYHPDINKEEGADEKFKEISEAYEVLSDENKRVNYDQFGH DGPQGGFGSQGFGGSDFGGFEDIFSSFFGGGSRQRDPNAPRKGDDLQYTMTITFEEAVFGTKKEISIKKD VTCHTCNGDGAKPGTSKKTCSYCNGAGRVSVEQNTILGRVRTEQVCPKCEGSGQEFEEPCPTCKGKGTEN KTVKLEVTVPEGVDNEQQVRLAGEGSPGVNGGPHGDLYVVFRVKPSNTFERDGDDIYYNLDISFSQAALG DEIKIPTLKSNVVLTIPAGTQTGKQFRLKDKGVKNVHGYGYGDLFVNIKVVTPTKLNDRQKELLKEFAEI NGENINEQSSNFKDRAKRFFKGE Use multiple sequence alignment to analyze how antitrap align with their corresponding likely long distant homologs UPGMA UPGMA tree Unweighted Pair Group Method with Arithmetic mean One of the fastest and tree construction methods Used in Pileup (GCG package) Clustal uses neighbor joining, but calculating NJ tree is much more demanding; thus, UPGMA is demonstrated here Enrique Merino, IBT-UNAM 25

Constructing MSA human ACGTACGTCC gorilla ACCACCGTCC chimp ACCTACGTCC orangutan ACCCCCCTCC human ACGTACGTCC chimp ACCTACGTCC MUSCLE. Stage 2: Improved Similarity Measure Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment gorilla ACCACCGTCC orangutan ACCCCCCTCC human ACGTACGTCC chimp ACCTACGTCC gorilla ACCACCGTCC orangutan ACCCCCCTCC maqaque CCCCCCCCCC TCC--AA TCA--GA TCA--AA G--ATAC T--CTGC TCC--AA TCA--AA MUSCLE Multiple Sequence Alignment with reduced time and space complexity Enrique Merino, IBT-UNAM 26