Overview Multiple Sequence Alignment

Similar documents
THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Large-Scale Genomic Surveys

Ch. 9 Multiple Sequence Alignment (MSA)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Week 10: Homology Modelling (II) - HHpred

Quantifying sequence similarity

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Sequence analysis and comparison

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Multiple Sequence Alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Tools and Algorithms in Bioinformatics

Multiple sequence alignment

An Introduction to Sequence Similarity ( Homology ) Searching

Algorithms in Bioinformatics

Pairwise sequence alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Single alignment: Substitution Matrix. 16 march 2017

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Sequence Alignment Techniques and Their Uses

Protein Structure Prediction Using Neural Networks

Effects of Gap Open and Gap Extension Penalties

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Pairwise sequence alignments

Multiple Sequence Alignment using Profile HMM

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Computational Biology

Multiple Sequence Alignments

Pairwise & Multiple sequence alignments

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Genomics and bioinformatics summary. Finding genes -- computer searches

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models

In-Depth Assessment of Local Sequence Alignment

Moreover, the circular logic

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Practical considerations of working with sequencing data

Multiple Sequence Alignment

COPIA: A New Software for Finding Consensus Patterns. Chengzhi Liang. A thesis. presented to the University ofwaterloo. in fulfilment of the

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Introduction to Evolutionary Concepts

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Multiple Sequence Alignment: HMMs and Other Approaches

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence analysis and Genomics

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Tools and Algorithms in Bioinformatics

Introduction to Bioinformatics Online Course: IBT

Alignment & BLAST. By: Hadi Mozafari KUMS

Computational Molecular Biology (

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Similarity searching summary (2)

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Introduction to Comparative Protein Modeling. Chapter 4 Part I

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

EBI web resources II: Ensembl and InterPro

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

EECS730: Introduction to Bioinformatics

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Probalign: Multiple sequence alignment using partition function posterior probabilities

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Introduction to Bioinformatics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Protein function prediction based on sequence analysis

Lecture 5,6 Local sequence alignment

A New Similarity Measure among Protein Sequences

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

Phylogenetic inference

CAP 5510 Lecture 3 Protein Structures

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

Hidden Markov Models

Bio nformatics. Lecture 23. Saad Mneimneh

Transcription:

Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments finding good alignments Alignment algorithms Local alignment methods and Pattern discovery Conclusion Definition Example A global alignment of a set of sequences is obtained by inserting into each sequence gap characters - so that the resulting sequences are of the same length and so that no column has only gap characters Take the sequences One alignment is INDUSTRY IMPORTANT IN-DU-STRY- IM-POR-TANT Example Example This is not an alignment: IN-DU--STRY- INTERE-STING This is not an alignment: IN-DU--STRY- INTERE-STING IM-POR--TANT IM-POR--TANT 1

Example: Chromo domains aligned Use of alignments Predict features of aligned objects conserved positions structurally/functionally important Use of alignments Predict features of aligned objects conserved positions structurally/functionally important patterns of hydrophobicity/hydrophilicity secondary structure elements Conserved positions Use of alignments Predict features of aligned objects conserved positions structurally/functionally important patterns of hydrophobicity/hydrophilicity secondary structure elements gappy regions loops/variable regions Helix pattern 2

Use of alignments Predict features of aligned objects conserved positions structurally/functionally important patterns of hydrophobicity/hydrophilicity secondary structure elements gappy regions loops/variable regions covariation structural proximity Loop? Loop? Loop? Use of Alignments - make patterns/profiles Can make a profile or a pattern that can be used to match against a sequence database and identify new family members Profiles/patterns can be used to predict family membership of new sequences Databases of profiles/patterns PROSITE PFAM PRINTS... Prosite: Motifs for classification Protein sequence Pattern from alignment [FYL]-x-[LIVMC]-[KR]-W-x-[GDNR]-[FYWLE]-x(5,6)-[ST]-W-[ES]-[PSTDN]-x(3)-[LIVMC] Prosite pattern 1 Prosite pattern 2 Prosite pattern n Family 1 Family 2 Family n Pattern Regular expression Profile 3

Alignment problem Given a set of sequences, produce a multiple alignment which corresponds as well as possible to the biological relationships between the corresponding bio-molecules For homologous proteins Two residues should be aligned (on top of each other) if they are homologous (evolved from the same residue in a common ancestor protein) if they are structurally equivalent Automatic approach Analysis of fitness function Need a way of scoring alignments fitness function which for an alignment quantifies its goodness Need an algorithm for finding alignments with good scores Not all methods provide a scoring function for the final alignment! One can test whether the alignments optimal under a given fitness function correspond well to the biological relationships between the sequences For example, if the structure of (some of) the proteins are known. Alignment scores We can define the score of an alignment of two sequences uses a scoring matrix (e.g., PAM, BLOSUM) gap penalty (linear, affine) Alignment scores: SP - sum-of-pairs A multiple alignment implies a pairwise alignment for each pair of sequences SP defines the score of the multiple alignment as the sum of scores of all implied pairwise alignments. 4

SP - example SP - definition IM-POR-TANT IN-DU-STRY- IM-POR-TANT Score: 15 IN-DU-STRY- IM-POR-TANT Score: 13 IN-DU-STRY- Score: 23 51 If A i,j is the score of the alignment implied for sequence pair (i,j), then the total score is: SP = A i, j i, j WSP - definition It is often useful to weight the sequence pairs WSP = w i A, i, j, j i j Tree Alignment It is assumed that an evolutionary tree for the sequences is known The sequences are leaves in the tree There may be strong biases in the sequence set (e.g., a large number of nearly identical sequences - pairs including one of these can be given low weights to reduce their impact on the score) Tree Alignment Problem: assign sequences to interior nodes scores can now be calculated for all edges in the tree so that the score summed over all edges is maximal The sequence assignments giving the best score defines the best alignment according to this measure and for the given tree. Tree alignment - example INDUSTRY???????????????? IMPORTANT INDUSTRIAL 5

Alignment Algorithms Given a set of n sequences of average length l, find a good alignment! For n=2, we have seen that dynamic programming can be used - time taken is proportional to l 2 =l n Sequence1 Dynamic programming for n sequences Assume we have n sequences of length l The table will have l n entries For example, 10 sequences of length 100 gives a table with 10 20 entries which would take at least 100 million Terrabytes (one byte per entry) of memory which would take about 3 million years to fill in if 1 million entries can be computed per second Sequence 2 Not feasible for n>4 or 5 Progressive alignment Progressive alignment Observations: Align two sequences at a time - can be done using dynamic programming The output of each pairwise alignment is an alignment Pairs of alignment/alignment or alignment/sequence can be aligned - using dynamic programming Strategy: Align first the most similar sequences Progressively align more distant sequences until all sequences have been aligned Use a rooted tree with the sequences at the leaves to decide the order of the alignments The Clustal Algorithm (A) 1 pairwise comparison 2 clustering/making tree Three steps: 1 Compare all pairs of sequences to obtain a similarity matrix 2 Based on the similarity matrix, make a guide tree relating all the sequences 3 Perform progressive alignment where the order of the alignments is determined by the guide tree (B) 3 Align according to tree 6

ClustalW - Score of aligning two alignment columns sum the score matrix entry for all pairs of residues weight each pair by the sequences weights ClustalW - Weighting sequences each sequence is given a weight groups of related sequences receive lower weight 1:peeksavtal 2:geekaavlal 3:egewglvlhv 4:aaektkirsa Score: M(t,v)+M(t,i)+ M(l,v)+M(l,i) 1:peeksavtal 2:geekaavlal 3:egewglvlhv 4:aaektkirsa Weighted score: w1*w3*m(t,v)+ w1*s4*m(t,i)+ w2*w3*m(l,v)+ w2*w4*m(l,i) ClustalW - Similarity matrix ClustalW - Gap penalties Distance between sequences - measure from the guide tree - determines which matrix to use 80-100% seq-id -> use Blosum80 60-80% seq-id -> Blosum60 30-60% seq-id -> Blosum45 0-30% seq-id -> Blosum30 Initial gap penalty GOP Gap extension penalty GEP GTEAKLIVLMANE GA---------KL Penalty: GOP+8*GEP ClustalW - Modifications of gap penalty Globin alignment Position specific penalty gap at position yes -> lower GOP no, but gap within 8 residues -> increase GOP hydrophilic residues lower GOP Default gap penalty GEP=0.05 7

Globin alignment - with insert Globin alignment - with insert Default gap penalty GEP=0.05 Lowered gap penalty GEP=0.01 ClustalW - summary Does not use a score for the final alignment Each pairwise alignment is done using dynamic programming Heuristics (e.g., gap-penalty modifications) are used - tailored to globular proteins Graphical version: ClustalX SAGA: Sequence Alignment by Genetic Algorithm An objective function is used to score the alignments An alignment is represented as a bit string A population of alignment is evolved Alignments can be combined (cross-over) Alignments can be mutated Alignments with higher score are more likely to be chosen for mating/survival Local Multiple Alignment Take one (zero/several) segment(s) (fragment) from each sequence and align them maximise similarity of aligned fragments most methods do not allow for gaps in the local alignment Example method: MEME 8

MEME - Motif Elucidation by Multiple EM EM= Expectation Maximisation Statistical method Builds a model of the local alignment Iteratively refines the model realigns the sequences to the model Example MEME output ---------------------------------------------------------------------- Possible examples of motif 1 in the training set ---------------------------------------------------------------------- Sequence name Start Score Site ------------- ----- ----- --------- 2BHD_STREX 81 28.80 VAYAREEFGS VDGLVNNAG ISTGMFLETE 3BHD_COMTE 81 25.99 MAAVQRRLGT LNVLVNNAG ILLPGDMETG ADH_DROME 86 22.33 LKTIFAQLKT VDVLINGAG ILDDHQIERT AP27_MOUSE 77 24.36 TEKALGGIGP VDLLVNNAA LVIMQPFLEV BA72_EUBSP 86 26.39 VGQVAQKYGR LDVMINNAG ITSNNVFSRV BDH_HUMAN 138 23.46 PFEPEGPEKG MWGLVNNAG ISTFGEVEFT BPHB_PSEPS 79 18.60 ASRCVARFGK IDTLIPNAG IWDYSTALVD BUDC_KLETE 80 20.97 VEQARKALGG FNVIVNNAG IAPSTPIESI DHES_HUMAN 84 25.67 AARERVTEGR VDVLVCNAG LGLLGPLEAL DHGB_BACME 87 26.39 VQSAIKEFGK LDVMINNAG MENPVSSHEM DHMA_FLAS1 198 16.36 ILVNMIAPGP VDVTGNNTG YSEPRLAEQV ENTA_ECOLI 73 21.90 CQRLLAETER LDALVNAAG ILRMGATDQL FIXR_BRAJA 112 23.67 EVKKRLAGAP LHALVNNAG VSPKTPTGDR GUTD_ECOLI 82 17.17 SRGVDEIFGR VDLLVYSAG IAKAAFISDF HDE_CANTR 92 20.90 VETAVKNFGT VHVIINNAG ILRDASMKKM... Motif Discovery Pratt - functionality Unaligned Sequences/structures Unaligned Sequences Aligner Analyse alignment Pattern Discovery Method Motif User parameters Pratt Patterns matching at least min nr. of input sequences Alignment or query sequence CM= 285, px=15 Pratt - Example 286 zinc finger containing sequences Pratt C-x(2,4)-C -x(3)-[ilvmfywc]-x(8)-h-x(3,5)-h matching 285 sequences Evaluation of Alignment Methods Align set of protein sequences where the structures are known (at least for some proteins) Align the protein structures Identify motifs from the structure alignment Check if sequence alignment has correctly aligned motifs McClure et al, 1994 Thompson et al, 1999 9

Alignments are important Basis for other analyses structure prediction phylogeny experiments PCR primer identification site directed mutagenesis... identification of motifs Open Problems - space for improvements! Good scoring function for alignments identify well aligned regions Efficient algorithms Resolving repeat structure, domain movements etc. Incorporating external information Future development More sequences More families, but not so many More densely populated families Easier alignment problem Identify more ancient relationships (superfamilies) More structures more sequences can be threaded alignments help 10