Ch. 9 Multiple Sequence Alignment (MSA)

Similar documents
Introduction to Bioinformatics Online Course: IBT

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence analysis and comparison

Week 10: Homology Modelling (II) - HHpred

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Overview Multiple Sequence Alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Multiple Sequence Alignments

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Multiple sequence alignment

Quantifying sequence similarity

Single alignment: Substitution Matrix. 16 march 2017

Large-Scale Genomic Surveys

Comparing Genomes! Homologies and Families! Sequence Alignments!

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequencing alignment Ameer Effat M. Elfarash

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Bioinformatics Exercises

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Chapter 11 Multiple sequence alignment

Introduction to protein alignments

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Alignment & BLAST. By: Hadi Mozafari KUMS

Sequencing alignment Ameer Effat M. Elfarash

Tools and Algorithms in Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Scoring Matrices. Shifra Ben-Dor Irit Orr

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

BIOINFORMATICS: An Introduction

Genomics and bioinformatics summary. Finding genes -- computer searches

Hands-On Nine The PAX6 Gene and Protein

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CSCE555 Bioinformatics. Protein Function Annotation

Moreover, the circular logic

An Introduction to Sequence Similarity ( Homology ) Searching

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Computational Biology

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Journal of Proteomics & Bioinformatics - Open Access

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Introduction to Bioinformatics Introduction to Bioinformatics

Using Bioinformatics to Study Evolutionary Relationships Instructions

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Pairwise sequence alignment

Protein function prediction based on sequence analysis

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Graph Alignment and Biological Networks

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Sequence Alignment Techniques and Their Uses

Introduction to Bioinformatics

Similarity searching summary (2)

Effects of Gap Open and Gap Extension Penalties

Practical considerations of working with sequencing data

Bioinformatics and BLAST

Multiple Sequence Alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

A New Similarity Measure among Protein Sequences

Introduction to Bioinformatics

Similarity or Identity? When are molecules similar?

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Sequences, Structures, and Gene Regulatory Networks

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

A Browser for Pig Genome Data

Phylogenetic inference

Getting To Know Your Protein

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Phylogenetic analyses. Kirsi Kostamo

Transcription:

Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA - comparing DNA or protein seqs. - difficult for DNA because of mutation - why MSA? search for evolutionary (phylogentic analysis) and structure similarity - for protein seqs, regions that are similar in seq. are usually superimpose in structure as well - because it is easy to generate bad alignments that looks good, we need to evaluate the quality of alignment (use Tcoffee) - one can comparing seqs. that cannot align Gibbs sampler (identify related segments of the same length) and Pratt (a motif discover tool) MSA is ideal for study seqs. share the same common ancestor MSA cannot be used if the seqs. has no similarity The MSA problem - given N seqs., with L = the longest of the aligned seq. - a minimal number of gaps is introduced in the seqs. so that the number of matches or similarities in each column is maximized - let A i be the ith seq. and A i,k is the residue at the kth position of the ith seq., G(A i ) is a gap penalty function - the objective of MSA is to maximize the score function Score(A N ) - it is the sum of all possible pairwise alignement scores, i.e. there are N(N-1)/2 possible alignments A i, k Score( A N ) = L N 1 N k = 1 i= 1 j= i+ 1 A j, k sub( A, A ) N G( A ) i, k j, k i i= 1 where sub(, A j,k ) is the score of substitution for (i.e. insertion, deletion, substitution). A i, k

2 distance (MAMPR, LOXAF) = 0.01852 + 0.00265 = 0.02117 distance (LOXAF, ELEMA) = 0.01852 + 0.00265 = 0.02117 distance (MAMAPR, ELEMA) = 0.01852 + 0.01852 = 0.03704 Help you research with MSA Applications Extrapolation Phylogentic analysis Pattern identification Domain identification DNA regulatory elements Structure prediction Procedure MSA help convince you that an uncharacterized seq. is really a member of a protein family If you carefully choose the seqs. to include in MSA, you can reconstruct the history of these proteins. By discovering very conserver region, one can identify regions that characterize function. Turn MSA into a profile that describes a protein family or domain. One can use this profile to scan the databases and look for new members for the family. Align promoters of a set of similarly regulated genes may reveal consensus binding sites for regulatory proteins. Turn a DNA MSA of a binding site into a weight matrix and scan the DNA database and look for potential binding site. A good MSA can give a very good prediction of the protein/rna secondary structure. - Consider the human parvalbumin, P20472, a calcium binding protein involved in muscle relaxation - Use Expasy BLAST server to retrieve similarity seqs., storing them and then MSA

3 - for protein/dna use blastp/tblastn to search the whole database - you can restrict the number of hits by selecting a smaller database, e.g. microbial database How to select the seqs. you want? - for a first analysis, select about 10 seqs., ideally the seqs. select is evenly spaced between very good E-value (say 10-40 ) and less good value (say 10-5 ) - select seqs. that were about the same length, don t select fragment seqs., (MSA is good for alignment seqs. having similar length) - pick P20472, P80079, P02626, P02619, P43305, P32930, Q91482, P02620, P02622

4 Three are three ways to export these seqs.: FASTA, ClustalW, Tcoffee EBI ClustalW server http://www.ebi.ac.uk/clustalw/index.html Interpreting MSA result - (*) an entirely conserved column - (:) indicates columns where all the residues have roughly he same size and same hydropathy - (.) indicates columns where the size or hydropathy is roughly the same - a good block is a unit with at least 10-30 amino-acids long exhibiting 1~3 (*), 5~7 (:) and a few (.) If you know the accession numbers of the seqs., you can retrieve them as shown in the following:

http://tw.expasy.org/sprot/sprot-retrieve-list.html P20472, P80079, P02626, P02619, P43305, P32930, Q91482, P02620, P02622 5 See Appendix A for all the seqs. FASTA format EBI ClustalW server http://www.ebi.ac.uk/clustalw/index.html

6 Changing ClustalW parameters Parameter Effect Substitution matrix Substitution matrices control the cost of mutations in seq. alignments. If you select a matrix, like PAM or BLOSUM, ClustalW automatically chooses the adapted index. If the seqs. are closely related, a change of matrix has no effect. If your seqs. alignment is difficult to interpret, it is worth to change from BLOSUM PAM. Gap-opening penalty (GOP) The higher the value of GOP, the more difficult it is to open a gap. Turning has little effect because ClustalW readjust GOP automatically. Gap-extension penalty (GEP) GEP control the size of the gaps. It is impossible to predict the optimal combination of GOP/GEP. The only way to find this combination is empirical. - reason for changing the parameters is to test whether slightly changes can improve the overall alignment. ClustalW - Clustal program by Higgins & Sharp,1988, - ClustalW is a more recent revision with W assign weights to the seqs. reflect the evolutionary changes in the aligned seqs. and the distribution of gaps between conserved domains - ClustalX a graphic interfaces Progressive algorithm (fast but errors made earlier in the process cannot be corrected (frozen-in errors)) for obvious errors edit manually tools : such Jalview (Chpater 10), Seaview or Cinema - Start with the most similarity seq. pair and continue to add seqs. in decreasing order of similarity - One builds a cluster of seqs. looks like a phylogenetic tree (dendrogram, that is the file with.dnd extension), for instance one has two alignments AB and CD - Then it aligns the two alignments as if each of them was a single seq., for instance one can replace each alignment with a consensus seq. - alignment of AB alignment of CD alignment of AB with CD (ABCD) alignment of (ABCD)E alignment of (ABCDE)F alignment of (ABCDE)F)G. Making MSA with Tcoffee - one of the most recently developed method for MSA - yields more accurate alignments at the cost of a longer running time

7 - - aln : a text file has the same format as ClustalW alignment - pdf Pattern of Conservation in MSA Amino Acid Characteristic W, Y, F Tryptophans large hydrophobic a.a., locate in the core of proteins, important for stability not easy to mutate, if mutate W Y or F (aromatic a.a.) - conserved aromatic a.a domains

G, P Often associated with beta strands or alpha helices C - conserved column of C C-C disulphide bridge - conserved columns of C with a distance signature of domains H, S Probably a catalytic site, especially proteases D, E, R, K Charged a.a. often involved in ligand binding or salt bridge (association of two ionic protein groups of +/- charge) L Rarely very conserved unless involved in protein-protein interactions such as leucine zipper. 8 Adding distantly related seq. - the alignment contains many conserved positions - add a few distantly related seq. one by one and check the effect of these seqs. on the overall alignment quality - want to make sure these distantly related seqs. enhance existing patterns rather destroy them - include those seqs. that BLAST reported as marginal hits when we scan SWISS-PROT for homologues - add P02591, TPCC_RABIT, the troponin C of rabbit, BLAST e-value = 3.1

9 - the new seq. respects the blocks that already existed, and shunting some conserved positions - it also reveals regions where insertion and deletions are likely to occur mostly likely a loop - add another distantly related seq. to check that these few highly conserved regions are indeed conserved across the whole protein family, even when we compare distantly related species - include P19123, TPCC_MOUSE, mouse troponin C, BLAST E-value = 3.1

- the most conserved columns remain - these two highly conserved regions are involved in some biological processes - we know most of the proteins binds calcium - one can safely bet that the calcium-binding site involves some of these conserved positions - in fact SWISS-PROT annotation indicate that these regions involved in calcium binding 10 Comparing sequences that you cannot align - sometimes may need to compare seqs. don t necessary have a common ancestor - Gibbs sampler looks for short, partially conserved gap-free segments - Pratt looks for flexible patterns that can contains gaps and only needs to conserved at certain positions Gibbs Sampler http://bioweb.pasteur.fr/seqanal/interfaces/gibbs-simple.html - stochastic method, difficult to reproduce the same results - but it can offer very sensible solutions - very good at identifying HTH (Helix Turn Helix) domains across a protein family - a nice way to search for regulatory elements shared by unrelated DNA seqs. - to get good results you need >20 seqs. - Gibbs sampler is useful only when the segments your are looking have exactly the same length, like HTH domains - For motif of different length use Pratt (http://www.ebi.ac.uk/pratt/index.html), TEIRESIAS (http://cbcsrv.watson.ibm.com/tspd.html), MEME (http://meme.sdc.edu/meme/website)

11 Appendix A >sp P20472 PRVA_HUMAN Parvalbumin alpha - Homo sapiens (Human). SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp P80079 PRVA_FELCA Parvalbumin alpha - Felis silvestris catus (Cat). SMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIEE DELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp P02626 PRVA_AMPME Parvalbumin alpha - Amphiuma means (Salamander) (Two-toed amphiuma). SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp P02619 PRVB_ESOLU Parvalbumin beta - Esox lucius (Northern pike). SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp P43305 PRVU_CHICK Parvalbumin, thymic CPV3 (Parvalbumin 3) - Gallus gallus (Chicken). MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp P32930 ONCO_HUMAN Oncomodulin (OM) (Parvalbumin beta) - Homo sapiens (Human). SITDVLSADDIAAALQECQDPDTFEPQKFFQTSGLSKMSANQVKDVFRFIDNDQSGYLDE EELKFFLQKFESGARELTESETKSLMAAADNDGDGKIGAEEFQEMVHS >sp Q91482 PRV1_SALSA Parvalbumin beta 1 (Major allergen Sal s 1) - Salmo salar (Atlantic salmon). MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp P02620 PRVB_MERME Parvalbumin beta - Merluccius merluccius (European hake). AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp P02622 PRVB_GADCA Parvalbumin beta (Allergen Gad c 1) (Gad c I) (Allergen M) - Gadus callarias (Baltic cod). AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG >sp P02591 TPCC_RABIT Troponin C, slow skeletal and cardiac muscles (TN-C) - Oryctolagus cuniculus (Rabbit). MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKIM LQATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE >sp P19123 TPCC_MOUSE Troponin C, slow skeletal and cardiac muscles (TN-C) - Mus musculus (Mouse). MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM LQATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE