Sequence Analysis '17- lecture 8. Multiple sequence alignment

Similar documents
Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple sequence alignment

SUPPLEMENTARY INFORMATION

Week 10: Homology Modelling (II) - HHpred

EVOLUTIONARY DISTANCES

Moreover, the circular logic

Quantifying sequence similarity

Phylogenetic inference

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Constructing Evolutionary/Phylogenetic Trees

EECS730: Introduction to Bioinformatics

bioinformatics 1 -- lecture 7

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Phylogenetic trees 07/10/13

An Introduction to Sequence Similarity ( Homology ) Searching

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Sequence Analysis '17 -- lecture 7

Multiple Sequence Alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Dr. Amira A. AL-Hosary

Large-Scale Genomic Surveys

Ch. 9 Multiple Sequence Alignment (MSA)

Introduction to Bioinformatics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetic Tree Reconstruction

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Tools and Algorithms in Bioinformatics

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

BINF6201/8201. Molecular phylogenetic methods

Effects of Gap Open and Gap Extension Penalties

Bioinformatics Exercises

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Similarity searching summary (2)

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Multiple Sequence Alignment. Sequences

Exercise 5. Sequence Profiles & BLAST

Probalign: Multiple sequence alignment using partition function posterior probabilities

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Introduction to Bioinformatics Online Course: IBT

Constructing Evolutionary/Phylogenetic Trees

HMMs and biological sequence analysis

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Phylogeny: building the tree of life

Evolutionary Tree Analysis. Overview

Some Problems from Enzyme Families

Overview Multiple Sequence Alignment

Multiple sequence alignment

Computational Biology

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Copyright 2000 N. AYDIN. All rights reserved. 1

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

BLAST. Varieties of BLAST

Supplementary Information

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Lecture Notes: Markov chains

Collected Works of Charles Dickens

Sequence comparison: Score matrices

Molecular Evolution and Phylogenetic Tree Reconstruction

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Single alignment: Substitution Matrix. 16 march 2017

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

How to read and make phylogenetic trees Zuzana Starostová

Comparing whole genomes

Algorithms in Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Building 3D models of proteins

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Large Grain Size Stochastic Optimization Alignment

Using Bioinformatics to Study Evolutionary Relationships Instructions

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Pairwise sequence alignments

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Heuristic Alignment and Searching

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

SUPPLEMENTARY INFORMATION

Lecture 2: Pairwise Alignment. CG Ron Shamir

Transcription:

Sequence Analysis '17- lecture 8 Multiple sequence alignment

Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database size and P() is the EVD, which models the number of random database search scores. So, by definition, the number of random database search scores is the e-value. m* P(S x) = 10 e-value = 10 m* P(S x) = 4 e-value = 4 m*p(s x) = 2 e-value = 2 m*p(s x) = 1 e-value = 1 m* P(S x) = 3 10 Random chance number of occurrences in a database search 1 1 10 e-value 2

Manual editing of alignments in UGENE Download and open bad alignment from the course web page* Align using Kalign. Can you make it better? Edit manually to consolidate gaps without forcing too many mismatches. How many indel events are implied by your alignment? * Opens as an alignment. Older versions of UGENE open this as a list of sequences instead of an alignment. If it does, select them and right-click/export all sequences as alignment, add to Project. 3

Methods for multiple sequence alignment Dynamic programming Star Progressive ClustalW, uses variable gap penalty Kalign. Very fast. Uses exact match. Progressive + stochastic Muscle. 4 MSA algorithms must be computationally efficient AND biologically relevant.

Is dynamic programming possible for more than two sequences? A 3 sequence alignment matrix... DP in 3D S(i,j,k) = MAX { A(i-1,j-1,k-1)+S(i,j,k), A(i-1,j,k)-gap, A(i,j-1,k)-gap, A(i,j,k-1)-gap, A(i-1,j-1,k)-gap, A(i-1,j,k-1)-gap, A(i,j-1,k-1)-gap } How about adding a 4th seq? How does DP run-time scale with number of seuqences? 5

Star alignment 1. Align all sequences to one sequence. 2. Stack them up. B Potential problems with star alignment: Unaligned gaps. Ambiguous associations C E A D A G H. I. W W. P F W P A G H. I I F W. P Y.. A G H I I.. W F P F W P A G H. I P W W. P... F G Each pairwise alignment by itself looks fine, but when you stack them up, you see disagreements. 6

What that alignment should look like. A G H I. W W P F W P A G H I I F W P Y.. A G H I I W F P F W P A G H I P W W P... 7

BLAST "query-anchored" alignments are star alignments 8

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned Current alignment { sequence to add A W P Y distance matrix gap A G H I. W W P F A G H I I F W P Y DP alignment matrix guide tree 9

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a distance matrix 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned sequence 1 1 2 3 sequence 2 1 2 3 10

"Distances" versus "similarities" Maximizing similarity and Minimizing distance are equivalent if d(i,j) + s(i,j) = s max, where s max is the maximum possible similarity, and the minimum distance is d=0. For each position in the alignment. Distance based on identity score (p-distance) d = 100 - %identity Distance using empirical J-C correction djc = -ln((s real -S rand )/(S ident -S rand )) where Sident = score of an identity alignment, and Srand = mode score of a false alignment. For proteins, Srand 25%. Twilight zone (R. Doolittle, 1986) djc sreal

Juke-Cantor for proteins Empirical J-C correction djc = -ln((pid-25)/75) where 25 = mode score of a false alignment. djc 0.25 0.75 sreal p-distance 0

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a distance matrix 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned distance matrix 13 Select shortest distance i,j Join i,j Reduce the rank of the distance matrix by joining columns i and j, rows i, j Minimum rule: select the minimum of the values Maximum rule: select the maximum of the values Repeat until rank = 1.

In class: progressive alignment Making a guide tree Neighbor-joining algorithm: A B C D E F A B C D E F A 97 81 77 82 59 32 80 55 31 90 65 40 61 42 33 Fill in J-C distances. B C D E F Draw guide tree here

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a distance matrix 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. 5. Add sequences until all sequences are aligned Current alignment { sequence to add A W P Y A G H I. W W P F A G H I I F W P Y DP alignment matrix

How do we represent two aligned sequences as one "sequence"? A G H I. W W P F A G H I I F W P Y A 1 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0.5 0 0 0.5 G 0 1 0 0 0 0 0 0 0 H 0 0 1 0 0 0 0 0 0 I 0 0 0 1 1 0 0 0 0 K 0 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 1 0 Q 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 W 0 0 0 0 0 0.5 1 0 0 Y 0 0 0 0 0 0 0 0 0.5

PSSMs and profiles 20xN scoring matrix. Set of probability distributions over the 20 amino acids. (Gap probabilities are (usually) not included.) P(a i) = ws / ws S Si=a [Spoken equation: The probability of amino acid a at position i is the sum of the sequence weights ws over all ] sequences S such that the amino acid at position i of that sequence Si is a, divided by the sum over the sequence weights ws for all sequences S.

Sequence weights??? w1 w2 0.75 0.25 A G H I. W W P F A G H I I F W P Y A 1 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0.25 0 0 0.75 G 0 1 0 0 0 0 0 0 0 H 0 0 1 0 0 0 0 0 0 I 0 0 0 1 1 0 0 0 0 K 0 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 1 0 Q 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 W 0 0 0 0 0 0.75 1 0 0 Y 0 0 0 0 0 0 0 0 0.25 18

Why do we need sequence weights? A MSA represents a sequence "family" A sequence family has an amino acid preference at each position. That preference is determined by counting. But, the MSA may be over-represented. primates rabbit rat E. coli lawyer

Sequence weighting corrects for uneven Simplest weighting scheme: Build a tree sampling Start with weight = 1.0 at the common ancestor of the tree. Split the weight evenly at each node. 1.000 0.500 0.125 0.250 0.250 Primate sequences are 10/18 of the tree, but only 0.125 of the weights, because they are overrepresented. 0.0625 0.0625 0.125 0.125 0.125 weights: 0.008 0.008 0.016 0.008 0.008 0.016 0.016 0.016 0.016 0.016 0.0625 0.031 0.031 0.031 0.031 0.0625 0.0625 0.500 primates rabbit rat lawyer E. coli

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a distance matrix 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned A W P Y { gap A G H I. W W P F A G H I I F W P Y DP alignment matrix matchscore =(0.25*S(P,W) + 0.75*S(P,F)) 21 Match score for multiple sequence alignments: matchscore(i,j) =ΣΣ wnwms(s n i,s m j) n m n=number of sequence in group 1 m=number of sequence in group 2 wn = weight of sequence n wm = weight of sequence m S(aa1,aa2) = substitution matrix value for aa1 to aa2

NOTE: Initial pairwise alignments are used to get the distances that are used make the guide tree, but these alignments are discarded and new alignments are made using the progressive method. 22

CLUSTALW JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994 Start with unrooted tree, using Neighbor joining. choose root to get guide tree progressive alignment matches are scored using sequence weights gaps are position dependent GOP lower for polar residues GOP zero where there is already a gap http://www.ebi.ac.uk/tools/msa/clustalw2/ http://www.ch.embnet.org/software/clustalw-xxl.html 23

Lightning-striking-twice-in-the same-place theory There should be no gap penalty for aligning a gap to an already existing gap! If i is already a gap position in any sequence, set gap(i)=0. A W P Y A G H I. W W P F A G H I I F W P Y A(i,j) = A(i-1,j) - gap(1,i) A(i,j) = A(i,j-1) - gap(2,j) No gap penalty for the purple arrow. Sequence-specific, Position-specific gap penalties. NOTE: DP is still optimal when the gap penalty is position-specific. 24

CLUSTALW Position specific gap penalty 25

MUSCLE Iterative MSA k-mer distance matrix UPGMA tree progressive alignment--> MSA1 UPGMA tree progressive alignment -->MSA2 For randomly selected tree branches: 1.split alignment into two groups 2.calculate profiles 3.align profiles 4.accept or reject the new alignment. 5.Repeat RC Edgar - Nucleic acids research, 2004 Not DP. Based on short identical matches One way to build a guide tree. 26

UPGMA Unweighted pair group method using averages A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ---- J-C corrected distances 1) Generate neighbor-joining tree. (NJ) 2) For first neighbors, distance to ancestor is dij/2 3) For next neighbors, distance to ancestor is average pairwise distance between taxa in two clades, divided by two. 4) Subtract to get lineage distances. 0.145 0.23 0.115 0.115 0.085 0.085 A B C D E raw p-distances To be discussed again when we talk about trees... 27

MUSCLE iterative alignment XP_001615335 YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS-- XP_002259219 YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS-- XP_001347897 YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN XP_726635 YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN-- XP_671449 ------------------------------------------------------------ XP_001458064 VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR-- XP_001347129 VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR-- XP_002283970 DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE-- XP_002367832 RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA phylogenetic tree X random cut point VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR-- VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR-- DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE-- RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS-- YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS-- YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN-- DP profile-profile alignment YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYV..SIFIYGNIAMPTEKEDENATS-- YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYV..SIFIYGNIAMTTEKENENATS-- YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYV..FIYGNIIISDLKGEENITKNN YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYI..NIFIYGNLSIPNEINIKNETN-- VVQAQYYTAELFLEELNILDLESLQQFHS..NYFSNFRVSSFVSGNILRSEVEDLLHSIR-- VVQAQYYTSQLFQDELATLDLESLQEFHS..NYFSNFRVSSFVSGNILRSEVEDLLHTIR-- DNTWPWMDG---LEVIPHLEADDLAKFVP..MLLSRAFLECYIAGNIEPKEAEAMIHHIE-- RNRFSQLDLRSAVTDASS-QFEDFKVFLE..KVLTKNALDVFIMGDIDYEEARKLAEDFRAA new MSA In each iteration: The phylogenetic tree is cut at a random branch, the two subtrees are converted to profiles, and aligned. The new alignment is either accepted or rejected 28

Databases of multiple sequence alignments balibase -- structural alignment-based BLOCKS -- gapless regions PFAM -- Hidden Markov models CDD -- conserved domain database FSSP -- structural alignment-based (families) 29

Visit balibase A database of curated multiple sequence alignments derived from structure-based alignments. http://www.lbgi.fr/balibase/ 30

Selective re-alignment Global affine-gap DP alignment may be used to refine an alignment between two, conserved and confidently aligned columns. Select. Align with MUSCLE. Selected columns. Or, paste into ClustalW web site. Use same penalty for opening gap and end gap. 31

Exercise 7: make a MSA due Oct 5 Select a protein sequence in NCBI. Run a BLAST search. Keep the top 50. Select the hits and download to a FASTA file. Open in UGENE (merge sequences into an alignment) Run MUSCLE. Color using Zappo. Reduce size so that the entire alignment (or as much of it as possible) fits on the screen. Save image. Paste into a file and write a blurb (10 words or less) Save as PDF and send to me in an email. 32

Review Are multiple sequence alignments optimal? How is phylogenetic information used in MSA algorithms? What are the advantages/disadvantages of a star alignment? What information is ClustalW encoding in its MSA algorithm? What is the outermost loop in the MUSCLE alignment? 33