Multiple Sequence Alignment: HMMs and Other Approaches

Similar documents
Multiple Sequence Alignment

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

HMMs and biological sequence analysis

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Stephen Scott.

Multiple Whole Genome Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment

EECS730: Introduction to Bioinformatics

Copyright 2000 N. AYDIN. All rights reserved. 1

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Today s Lecture: HMMs

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Computational Genomics and Molecular Biology, Fall

Week 10: Homology Modelling (II) - HHpred

Computational Genomics and Molecular Biology, Fall

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models

Comparative Network Analysis

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Sequence analysis and comparison

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Overview Multiple Sequence Alignment

HIDDEN MARKOV MODELS

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment using Profile HMM

Graph Alignment and Biological Networks

Hidden Markov Models

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Alignment Algorithms. Alignment Algorithms

Membranes 2: Transportation

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Sequence Alignment Techniques and Their Uses

Moreover, the circular logic

BMI/CS 576 Fall 2016 Final Exam

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Hidden Markov Models. x 1 x 2 x 3 x K

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Hidden Markov Models (I)

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models (HMMs) and Profiles

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

11.3 Decoding Algorithm

LIFE SCIENCES: PAPER I ANSWER BOOKLET

Tools and Algorithms in Bioinformatics

Sequence Analysis '17- lecture 8. Multiple sequence alignment

1.5 Sequence alignment

Introduction to Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

SUPPLEMENTARY INFORMATION

Effects of Gap Open and Gap Extension Penalties

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Bioinformatics Exercises

Hidden Markov models in population genetics and evolutionary biology

Dynamic Programming: Edit Distance

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Biology New Jersey 1. NATURE OF LIFE 2. THE CHEMISTRY OF LIFE. Tutorial Outline

Ch. 9 Multiple Sequence Alignment (MSA)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Ch 11.4, 11.5, and 14.1 Review. Game

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

O 3 O 4 O 5. q 3. q 4. Transition

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Pairwise & Multiple sequence alignments

Identifying Signaling Pathways

Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience

Markov Chains and Hidden Markov Models. = stochastic, generative models

p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

Lecture 7 Sequence analysis. Hidden Markov Models

Pairwise alignment using HMMs

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

HMM : Viterbi algorithm - a toy example

Multiple sequence alignment

Chapter 4: Hidden Markov Models

Local Alignment Statistics

EVOLUTIONARY DISTANCES

Evolutionary Models. Evolutionary Models

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Inferring Transcriptional Regulatory Networks from High-throughput Data

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Pair Hidden Markov Models

Part 2- Biology Paper 2 Inheritance and Variation Knowledge Questions

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Science Unit Learning Summary

Hidden Markov Models

Transcription:

Multiple Sequence Alignment: HMMs and Other Approaches Background Readings: Durbin et. al. Section 3.1, Ewens and Grant, Ch4. Wing-Kin Sung, Ch 6 Beerenwinkel N, Siebourg J. Statistics, probability, and computational science. In M. Anisimova (ed.) Evolutionary Genomics: Statistical and Computational Methods. Springer, to appear. Section 6. Prepared by Zohar Yakhini (Technion) using material from slides by Colin Dewey (U of Wisconsin).

Multiple Sequence Alignment: Task Definition Given a set of more than 2 sequences a method for scoring an alignment Do: determine the correspondences between the sequences such that the alignment score is maximized 2

Multiple Sequence Alignment 3

Related Task: Profile Alignment Given a query sequence, s a database of profiles (based on multiple alignment of related sequences) P i, i =1 k Do: compute an alignment score Φ(s, P i ) for each one of the profiles determine the best matching profile for the query sequence 4

Scoring an alignment How do we assess the quality of a given alignment? We will work with column scores: S = S j j Sum of pairs: ( k l S ) j = SM m j, m j Where k< l m k j is the character used in the j-th column, for the k-th sequence aligned and SM is some substitution scoring matrix operating on the relevant characters Minimum entropy: S j is the frequency based entropy of the j-th column 5

Dynamic Programming 6

Time complexity: exponential in the number of sequences When using sum of pairs When using entropy ( k ) k 2 n k O 2 ( k ) k n k O 2 7

Heuristic approaches Since the time complexity of the DP approach is exponential in the number of sequences, heuristic methods are usually used Progressive Alignment: construct a succession of pairwise alignments Star approach Tree approaches (like CLUSTALW Thompson et al 1994) Iterative Refinement Given a multiple alignment (say from a progressive method), remove a sequence, realign it to profile of other sequences Repeat until convergence 8

Star-Shaped Alignment Given: k sequences to be aligned: x x k 1 xc Pick one sequence as the center : For each x x determine an optimal i c pairwise alignment with the center. Merge pairwise alignments Return the multiple alignment resulting from the aggregation 9

Star-Shaped Alignment: Example 10

Star-Shaped Alignment: Example. The merging stage 11

Star-Shaped Alignment: Example. The merging stage - cont 12

Star-Shaped Alignment: Picking the center Try all sequences as centers and then return the best resulting alignment Select as center the sequence that maximizes Φ( x, x ) i c x i x c The SP distances score resulting from these approaches are at most twice the SP distance score of the optimal alignment (Gusfield 1993, Bafna et al 1997) 13

Aligning a query sequence to a profile Use existing knowledge about a family of proteins to produce an HMM model for the family Determine the fitness of any given query to any family (Viterbi and/or the Forward Algorithm ) Determine the most fit family for a given query, amongst several possibilities Possibly add the query to the family using an alignment determined by a Viterbi path 14

HMM Profile of an alignment 17

HMM Profiles The HMM Profile Graph, as above, represents the transition matrix and the emission distributions of an HMM that describes a protein family. The model has a length. It is 3 in the example above. Parameters are inferred from a given alignment. For example by frequency counting. 18

Example Consider: CAFTPA CKTTPA CA-TPD CAF--D Then for a model of length 6 we have: M(start,i0) = M(start,d1) = ε M(start,m1) = 1-2 ε E(m1,C) is close to 1 and all other a.as get ε/19, say M most likely takes m1 to m2. E(m2,A) ~ 0.75; E(m2,K) ~ 0.25; other a.as get ε fractions M(m2,m3) ~ 0.75; M(m2,d3) ~ 0.25 E(m3,F) ~ 0.66; E(m3,T) ~ 0.33 Etc M(*6,end) = 1, of course 19

Example Consider: CAFTPA CKTTPA CA-TPD CAF--D And the query: CDAFPD Then the most probable path through the model would be: start,m1,i1,m2,m3,d4,m5,m6, end Which leads to the alignment: C-AFTPA C-KTTPA C-A-TPD C-AF--D CDAF-PD 20

Pfam http://www.sanger.ac.uk/pfam/ A web-based resource and platform maintained by the Sanger Center that uses the above theory to classify proteins and/or to determine domains in given query protein sequences. 21

The Cystic Fibrosis gene Cystic Fibrosis (CF) a recessive genetic disease caused by a defect in a single gene, the one coding for CFTR Causes the body to produce abnormally thick mucus that clogs the lungs and the pancreas, often leading to very early death The cystic fibrosis conductance regulator (CFTR) gene and its role in CF were identified in 1989 [Riordan et al., Science 1989 ; Kerem et al, Science 1989] The CFTR gene resides at Chr7 q31.2. It is 230,000bp long, and creates a protein with 1,480 a.as. Most common mutation is called ΔF508; a deletion of a phenylalanine (F) at position 508 in the CFTR protein In the United States, approximately 30,000 individuals have CF. 1 in every 25 people of European descent is a carrier of some potentially limiting CFTR mutation. 22

The Cystic Fibrosis protein What does it do? 23

CFTR two important domains Two key features of the protein are evidenced in the MSA (and based on other analyses and prior knowledge of the aligned proteins): o Membrane-spanning domains o ATP-binding motifs These features indicated that CFTR is likely to be involved in transporting ions across the cell membrane This is consistent with the association of CF to salt cellular transport and to how defects in this mechanism result in thicker mucus. 24