Multiple Sequence Alignment: HMMs and Other Approaches Background Readings: Durbin et. al. Section 3.1, Ewens and Grant, Ch4. Wing-Kin Sung, Ch 6 Beerenwinkel N, Siebourg J. Statistics, probability, and computational science. In M. Anisimova (ed.) Evolutionary Genomics: Statistical and Computational Methods. Springer, to appear. Section 6. Prepared by Zohar Yakhini (Technion) using material from slides by Colin Dewey (U of Wisconsin).
Multiple Sequence Alignment: Task Definition Given a set of more than 2 sequences a method for scoring an alignment Do: determine the correspondences between the sequences such that the alignment score is maximized 2
Multiple Sequence Alignment 3
Related Task: Profile Alignment Given a query sequence, s a database of profiles (based on multiple alignment of related sequences) P i, i =1 k Do: compute an alignment score Φ(s, P i ) for each one of the profiles determine the best matching profile for the query sequence 4
Scoring an alignment How do we assess the quality of a given alignment? We will work with column scores: S = S j j Sum of pairs: ( k l S ) j = SM m j, m j Where k< l m k j is the character used in the j-th column, for the k-th sequence aligned and SM is some substitution scoring matrix operating on the relevant characters Minimum entropy: S j is the frequency based entropy of the j-th column 5
Dynamic Programming 6
Time complexity: exponential in the number of sequences When using sum of pairs When using entropy ( k ) k 2 n k O 2 ( k ) k n k O 2 7
Heuristic approaches Since the time complexity of the DP approach is exponential in the number of sequences, heuristic methods are usually used Progressive Alignment: construct a succession of pairwise alignments Star approach Tree approaches (like CLUSTALW Thompson et al 1994) Iterative Refinement Given a multiple alignment (say from a progressive method), remove a sequence, realign it to profile of other sequences Repeat until convergence 8
Star-Shaped Alignment Given: k sequences to be aligned: x x k 1 xc Pick one sequence as the center : For each x x determine an optimal i c pairwise alignment with the center. Merge pairwise alignments Return the multiple alignment resulting from the aggregation 9
Star-Shaped Alignment: Example 10
Star-Shaped Alignment: Example. The merging stage 11
Star-Shaped Alignment: Example. The merging stage - cont 12
Star-Shaped Alignment: Picking the center Try all sequences as centers and then return the best resulting alignment Select as center the sequence that maximizes Φ( x, x ) i c x i x c The SP distances score resulting from these approaches are at most twice the SP distance score of the optimal alignment (Gusfield 1993, Bafna et al 1997) 13
Aligning a query sequence to a profile Use existing knowledge about a family of proteins to produce an HMM model for the family Determine the fitness of any given query to any family (Viterbi and/or the Forward Algorithm ) Determine the most fit family for a given query, amongst several possibilities Possibly add the query to the family using an alignment determined by a Viterbi path 14
HMM Profile of an alignment 17
HMM Profiles The HMM Profile Graph, as above, represents the transition matrix and the emission distributions of an HMM that describes a protein family. The model has a length. It is 3 in the example above. Parameters are inferred from a given alignment. For example by frequency counting. 18
Example Consider: CAFTPA CKTTPA CA-TPD CAF--D Then for a model of length 6 we have: M(start,i0) = M(start,d1) = ε M(start,m1) = 1-2 ε E(m1,C) is close to 1 and all other a.as get ε/19, say M most likely takes m1 to m2. E(m2,A) ~ 0.75; E(m2,K) ~ 0.25; other a.as get ε fractions M(m2,m3) ~ 0.75; M(m2,d3) ~ 0.25 E(m3,F) ~ 0.66; E(m3,T) ~ 0.33 Etc M(*6,end) = 1, of course 19
Example Consider: CAFTPA CKTTPA CA-TPD CAF--D And the query: CDAFPD Then the most probable path through the model would be: start,m1,i1,m2,m3,d4,m5,m6, end Which leads to the alignment: C-AFTPA C-KTTPA C-A-TPD C-AF--D CDAF-PD 20
Pfam http://www.sanger.ac.uk/pfam/ A web-based resource and platform maintained by the Sanger Center that uses the above theory to classify proteins and/or to determine domains in given query protein sequences. 21
The Cystic Fibrosis gene Cystic Fibrosis (CF) a recessive genetic disease caused by a defect in a single gene, the one coding for CFTR Causes the body to produce abnormally thick mucus that clogs the lungs and the pancreas, often leading to very early death The cystic fibrosis conductance regulator (CFTR) gene and its role in CF were identified in 1989 [Riordan et al., Science 1989 ; Kerem et al, Science 1989] The CFTR gene resides at Chr7 q31.2. It is 230,000bp long, and creates a protein with 1,480 a.as. Most common mutation is called ΔF508; a deletion of a phenylalanine (F) at position 508 in the CFTR protein In the United States, approximately 30,000 individuals have CF. 1 in every 25 people of European descent is a carrier of some potentially limiting CFTR mutation. 22
The Cystic Fibrosis protein What does it do? 23
CFTR two important domains Two key features of the protein are evidenced in the MSA (and based on other analyses and prior knowledge of the aligned proteins): o Membrane-spanning domains o ATP-binding motifs These features indicated that CFTR is likely to be involved in transporting ions across the cell membrane This is consistent with the association of CF to salt cellular transport and to how defects in this mechanism result in thicker mucus. 24