Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff Adapted from a course by Dr. Dominic Schumacher.

2 By now,...... we modeled the development over different positions in one DNA sequence ( time =position in the sequence). Today,......we use Markov chains to model the development of individual positions in a DNA or Protein sequence over time ( time =time in which the sequence evolves).

3 Objectives for today: Studying models for sequence evolution Transition matrices. Deriving matrices that give a realistic score to each possible substitution in a sequence (DNA or protein) Scoring matrices.

Models for sequence evolution (DNA) Each site of the DNA sequence evolves according to a first order Markov Chain with state space {A,C,G,T}. 4

5 Simplest model for DNA-evolution: Jukes-Cantor (1969) p a,a p a,c p a,g p a,t p c,a p c,c p c,g p c,t p g,a p g,c p g,g p g,t p t,a p t,c p t,g p t,t = 1 3α α α α α 1 3α α α α α 1 3α α α α α 1 3α The stationary distribution is π = (0.25, 0.25, 0.25, 0.25). Necessary: α < 1/3. The parameter α depends on the time scale: E.g., if the unit time is 100.000 generations, α would take a smaller value than if the unit time were chosen as 200.000 generations.

Objection: The Jukes-Cantor model is not entirely realistic (e.g. all kinds of substitutions are equally likely). 6

7 More realistic model: The Kimura model (1980) P = 1 α 2β β α β β 1 α 2β β α α β 1 α 2β β β α β 1 α 2β The stationary distribution is π = (0.25, 0.25, 0.25, 0.25). Necessary: α + 2β < 1. α = prob. for transitions (purine to purine or pyrimidine to pyrimidine). β = prob. for transversions (purine to pyrimidine or vice versa) (purines: a,g / pyrimidines: c,t)

8 Comments: There are even more realistic Markov models for DNA substitution (e.g. Kimura 3ST, Felsenstein, Hasegawa, Kishino, Yano, and many others). Most models assume that sites evolve independently (which is not entirely realistic). Why not use more realistic models? Because they lack nice properties, e.g., reversibility and stationary distributions. For more sophisticated models it is more complicated to compute the probabilities of interest. there are more parameters to estimate. The simpler ones seem to give reasonably sensible results.

9 Evolutionary models for proteins The concept is similar, but here it is much more important to account for the wide range of different transition probabilities associated with amino acid substitutions. From: Taylor W.R. (1986) The classification of amino acid conservation: a strategy for the hierarchical analysis of residue conservation, Bioinformatics, 9:745-56 See the construction of substitution scoring matrices!

Substitution scoring matrices 10

11 Scoring systems Important in bioinformatics for comparisons of DNA or protein sequences. Aim: inferring the function of molecules by finding similarity to a sequence with known function. For that purpose needed: Good alignment of sequences. For that purpose needed: A measure for judging the quality of an alignment in relation to other possible alignments, a scoring system.

12 We use additive scoring systems: Assign a quality score for each match and sum up scores of individual positions (ignoring gaps for now). Example: Two DNA sequences; score for a match: +1, score for a mismatch 1. E.g.: a a g t t t c t t g a a a c t c c c t g Individual scores: 1 1-1 -1 1-1 1-1 1 1 = Cumulative score: 6 4 = 2 Maybe more realistic: score for a match: +1, score for a transition: 1/2, score for a transversion: 1. Cumulative score in that case: 6 2 = 4.

13 Scoring matrices The scores for the individual positions can be displayed in a so-called scoring matrix (also called substitution matrix). This is usually a symmetrical 4 4 (DNA) resp. 20 20 (protein) matrix which has as entries (i, j) the scores that we assign if at a position the nucleotides resp. the amino acids i and j are aligned. E.g. for the second example from the last slide: s a,a s a,c s a,g s a,t 1 1 1/2 1 s c,a s c,c s c,g s c,t 1 1 1 1/2 S = = s g,a s g,c s g,g s g,t 1/2 1 1 1 s t,a s t,c s t,g s t,t 1 1/2 1 1

14 How can we find a biologically sensible scoring matrix? For DNA sequences: simple scoring matrices (like the one presented) are often effective. Usually, no need to worry. For protein sequences: some substitutions are clearly more likely to occur than others (presumably due to similar chemical properties of the amino acids involved); e.g. isoleucine for valine, serine for threonine, so-called conservative substitutions. We get considerably better alignments if we take this into account. Use scoring matrices that are derived by statistical analysis of protein data.

15 Scoring matrices for proteins Specifications: Identical amino acids should be given a higher score than any substitution. Conservative substitutions should be given a higher score than non-conservative ones (e.g., substitution between hydrophobic or charged amino acids). We want our scoring matrices to take into account the evolutionary distance between the sequences involved!

16 Generally: substitutions with low transition-probability should be given a lower score than those that occur very frequently (like self-transitions). In fact: Transition matrix Log-Likelihood ratios = Scoring matrix

17 Approaches to find scoring matrices 1) the PAM family of substitution matrices Uses Markov chains and phylogenetic trees to fit an evolutionary model; log-likelihood ratios for the construction of a scoring matrix from an estimated transition matrix. 2) the BLOSUM family of substitution matrices Uses log-likelihood ratios for the construction of a scoring matrix from an estimated transition matrix. 3) WAG and WAG* substitution matrices Combines the estimation of transition and scoring matrices by a maximum-likelihood approach. Most novel of the three methods and nowadays widely used.

18 The PAM family of scoring matrices (Dayhoff, Schwartz, and Orcutt, 1978) It requires the use of Markov chains and phylogenetic trees (for fitting an evolutionary model) log-likelihood ratios (for getting a scoring matrix from an estimated transition matrix) PAM = Point (or Percentage) Accepted Mutations. Accepted mutations = those mutations able to spread in the population and become dominating (typically mutations that do not disrupt the protein function or even increases the fitness of the species).

19 PAM matrices (Dayhoff matrices) There are two types of PAM matrices: a PAM Markov transition matrix P (= the table of estimated transition probabilities for the underlying evolutionary model). a PAM scoring matrix (= the table of scores for all possible pairs of amino acids, which is used to judge the quality of a given alignment). Again: Transition matrix Scoring matrix

20 Underlying model: Each site in the sequence evolves according to a reversible Markov chain, and independently of the other sites. All the Markov chains have the same transition matrix P (matrix with dimension 20 20). Dayhoff et al. (1978) estimated the one-step transition matrix P from protein sequence data. How...?

21 Construction of a PAM1 transition matrix Definition: A PAM1 transition matrix is the Markov transition matrix applying for a time period over which we expect 1% of the amino acids to undergo accepted point mutations. The steps involved in the estimation: 1. Find reliable data and align protein sequences that are at least 85% identical. 2. Reconstruct phylogenetic trees and infer ancestral sequences. 3. Count the amino acid replacements that occurred along the trees (i.e. count mutations accepted by natural selection). 4. Use these counts to estimate the Markov transition probabilities between amino acids.

22 Step 1: Find reliable data Dayhoff et al. (1978) used ungapped multiple alignments of certain well-conserved regions from closely related proteins. (71 blocks of proteins from 34 families, all in all 1572 changes.) AAEEAATG...G CE CAP PAATH...GTE PPAV AS TH......GCG VVIG AAAH... GAI >85% In any block, any two sequences did not differ by more than 15%. (The idea was to keep the number of sites that have encountered several changes low.) Note: Only a limited amout of protein sequences was available at that time!

23 Step 2: Reconstruct phylogenetic trees The aligned regions then were used to infer the underlying evolutionary tree(s) and the ancestral sequences. For this, most parsimonious trees were used. A most parsimonious tree is a tree structure such that the total number of substitutions across the tree is minimal. The protein sequences of one block are the leaves of a tree. AA EE Data: seq1: seq2: seq3: AA AE EE AE 0 1 1 0 AE EE AA AE 1 0 1 0 AA AE EE

24 Why do we use trees? To avoid overcounting! Trees = sequences are grouped in the right way (in general: very similar sequences succeed one another in the tree) = we have mainly transitions between these sequences, and only a few transitions to other, more different sequences, so the corresponding substitutions do not get an unnatural importance.

25 Step 3: Count the replacements along the trees Example There are five most parsimonious trees for the three sequences AA, AE and EE: AE 1 0 1 AA 0 AE 1 1 0 EE 0 AE 0 1 1 AE 0 AE EE AA AA AE EE AA EE AE AE AE AE 0 1 0 1 AE 1 0 0 1 EE AE AA AA AE EE Count all substitutions along the branches (example: 1 indicates an A-E alignment).

26 The substitution A-E (and E-A), occurs exactly twice in each tree. The substitution AA (and EE) occurs a total of 15 times in the trees. Self-alignments are counted twice, thus we get 30. Divide by the number of trees to get an average of 6. In summary we get the count matrix A A A E A = = E A E E 6 2 2 6

27 Now, we do not have only two letters A and E, but 20 amino acids. They shall be numbered from 1 to 20 and transitions between them are counted just as in the example above. Here, let A jk be the average number of times substitutions from j to k were observed in the trees: A 1,1 A 1,2... A 1,20 A 2,1 A 2,2... A 2,20 A =............ A 20,1 A 20,2... A 20,20 The counts can be summed over different blocks of sequences.

28 Step 4: Estimate the Markov transition probabilities The estimated probabilities j k are the observed relative frequencies: a jk := A jk 20 m=1 A. j,m Remember: PAM1 is a transition matrix where 1% of the amino acids are exptected to undergo an accepted point mutation. Thus, to get P = (p jk ) the a jk s have to be scaled by a factor c: p jk := c a jk for j k and p j,j := 1 k j c a jk, where the scaling constant c is sufficiently small so that p j,j 0 for all j.

29 The scaling factor c The factor c enforces the 1% of accepted point mutations. This is useful for relatively short evolutionary distances. Such a time unit is called an evolutionary distance of 1 PAM. (Note: 1 PAM can be 1 Mio of years, but also much more or less, depending on the protein family.)

30 The determination of c Let Z n = the amino acid present at a particular site considered at time n (hence 1 Z n 20). The probability that the site will change after 1 PAM time unit (i.e. after one step) is given by P(Z n+1 Z n ) 20 q j p jk, j=1 k j where q j is the observed frequency of the amino acid j in the original blocks of aligned proteins.

31 One wants the probability that the site will change after 1 PAM to be equal to 0.01. (That implies an average change of 1%.) 0.01 = 20 q j p jk j=1 k j 20 q j c a jk j=1 k j = c 20 q j a jk. j=1 k j c = 20 j=1 0.01 k j q j a jk

32 How can the transition matrix be turned into a scoring matrix? Consider two given protein sequences s = a 1 a 2 a n and s = b 1 b 2 b n at an evolutionary distance of 1 PAM. (Note: the evol. distance between sequences is difficult to exactly determine, but for closely-related sequences also non-optimal matrices give good results.) The score for aligning s with s is generated by comparing two different hypothesis H 0 and H 1 : H 0 :s and s are not evolutionarily related H 1 : s and s are evolutionarily related (i.e. s depends on s via the Markov model).

33 Under H 0, we have a chance alignment s: a 1 a 2 a n s : b 1 b 2 b n Supose amino acid j appears with probability q j. The probability for getting this chance alignment is equal to ( n ) ( n ) P H0 (the alignment) = q ai q bi i=1 i=1 n = (q ai q bi ). i=1

34 Under H 1, the sites in the sequences are dependent according to the Markov model and thus transition probabilities are those in the PAM1-matrix. Example: P H1 (align P and R in a given site) = q P p PR. Since the different sites evolve independently of each other, we get P H1 (the alignment) = n (q ai p ai b i ). i=1

35 We want our score to reflect the chance (or the odds) that with s and s we have aligned evolutionarily related sequences (i.e. we want a high score if it is very likely that we have aligned related sequences). A natural choice for the score: The likelihood ratio (odds): Score = P H 1 (the alignment) P H0 (the alignment) n i=1 = (q a i p ai b i ) n i=1 (q a i q bi ) = n i=1 q ai p ai b i q ai q bi = n i=1 p ai b i q bi.

36 Or, equivalently, one can use the log likelihood ratio (the log odds ). The score S of the alignment then becomes: ( PH1 (the alignment) ) S = log P H0 (the alignment) ( n = log = i=1 n log i=1 p ai b i q bi ) ( pai b i ). q bi Advantage: log turns products into sums additive scoring scheme. The entry (a, b) in the PAM substitution matrix is then of the form ( pab ) S ab = log q b (usually rounded to the nearest integer for convenience).

37 Total score: S(alignment) = n i=1 S a i b i. Moreover: if S ab = log p ab q b < 1 ( pab q b ) < 0 q a p ab < q a q b (i.e. S ab < 0 if it is more likely to see a and b aligned against each other in a random alignment than to see a and b aligned at the evolutionary distance of 1 PAM). Otherwise, S ab = log ( pab q b ) 0.

38 PAMn substitution matrix For sequences having an evolutionary distance of n PAM units. Careful: n PAM units does not mean that we expect n% of the amino acids to differ... because substitutions can occur at the same site many times! Let P be the 1 PAM transition matrix. As always with Markov chains: the n-step transition probabilities p (n) ab are given as the entries in P n. The scores are S (n) ab = log ( p (n) ab q b ). Highest level: PAM250 (approx. 20% sequence similarity).

39 PAM criticism Assumption that each site has same mutuability. Assumption that sites evolve independenty. Construction of phylogenetic trees (especially most parsimonious ones) is very difficult (finding the most parsimonious tree is an NP-hard problem). Matrices for greater evolutionary distances are extrapolated from the PAM1 matrix. Based on a small set of closely related proteins.

40 The BLOSUM family of substitution matrices (Henikoff and Henikoff, 1992). BLOSUM = BLOcks SUbstitution Matrices. Again the scores will be logarithms of likelihood ratios, but this time there are no evolutionary models, and hence no Markov chains and no trees involved. The likelihoods are obtained by the statistical analysis of blocks of aligned sequences.

41 Blocks of aligned sequences About 3000 blocks (extracted from multiple sequence alignments) involving about 800 protein families were used to derive the BLOSUMs matrices (H&H developed a program called Protomat for obtaining these blocks). A difference to the construction of PAM matrices: the kind of data that was used! H&H s data was far more extensive: several hundred groups of proteins, at least 2369 occurances of any particular substitution. Different concept: Dayhoff et al. used data from closely related proteins and extrapolated (PAM1 PAMn), H&H directly used protein sequences regardless of their evolutionary distances. BLOSUM uses no evolutionary model (i.e., no need to order the sequences according to evol. distance - no trees).

42 Four sample blocks from H&H s Blocks database: WWYIR CASILRKIYIYGPV GVSRLRTAYGGRK NRG WFYVR CASILRHLYHRSPA GVGSITKIYGGRK RNG WYYVR AAAVARHIYLRKTV GVGRLRKVHGSTK NRG WYFIR AASICRHLYIRSPA GIGSFEKIYGGRR RRG WYYTR AASIARKIYLRQGI GVGGFQKIYGGRQ RNG WFYKR AASVARHIYMRKQV GVGKLNKLYGGAK SRG WFYKR AASVARHIYMRKQV GVGKLNKLYGGAK SRG WYYVR TASIARRLYVRSPT GVDALRLVYGGSK RRG WYYVR TASVARRLYIRSPT GVGALRRVYGGNK RRG WFYTR AASTARHLYLRGGA GVGSMTKIYGGRQ RNG WFYTR AASTARHLYLRGGA GVGSMTKIYGGRQ RNG WWYVR AAALLRRVYIDGPV GVNSLRTHYGGKK DRG Each block stems from an ungapped multiple alignment of a relatively highly conserved region of a protein family.

43 Likelihoods Count the proportion p a of times that the AA a occurs somewhere in any block; the proportion p ab of times that the AA pair (a, b) (not necessarily distinct) occurs in the same column of any block. Note: there are a total of N (m 2) pairs that have to be taken into account in all the blocks, where N is the number of columns in all blocks together and m is the number of rows in each block, if we assume that this number is the same for all blocks.

44 The likelihood ratios p ab : likelihood / estimated probability for the substitution a b under the assumption that the sequences are related. p a p b : likelihood / estimated probability for the substitution a b (also for b a) under the assumption that the sequences are not related. Thus, for a b the likelihood for the substitution a b is 2 p a p b. A scoring matrix is obtained by the same considerations as for the PAM matrix. Set the entry at position (a, b) as ( 2 log pab ) 2 2 p S ab := a p b if a b ( 2 log pab 2 if a = b. p a p b ) (2log 2 was what H&H used; using log would give us essentially the same score) This is not yet the BLOSUM we are looking for!

45 The circularity problem iteration To obtain the initial blocks, a multiple alignment was done a substitution matrix is needed for that! Henikoff and Henikoff used a simple unit matrix for this first alignment (1 for a match, 0 for a mismatch). Then, with the first BLOSUM matrix they obtained by the above procedure, they constructed a second set of blocks and a second BLOSUM matrix. Then, with this second matrix a third matrix was constructed. Only this is the matrix that is recommended to be used. This gives a BLOSUM100 matrix, provided that we have eliminated all identical copies of sequences from the original blocks. The BLOSUM100 is not very useful!

46 BLOSUMx matrices To take into account low or large evolutionary distances between sequences, different BLOSUM matrices are constructed by clustering those sequences in each block that are sufficiently close (by combining them and using a weighted contribution to the counting). The result is called BLOSUMx (matrix), where the number x determines what we mean by sufficiently close : We cluster sequences in a block that have x% identity or more. BLOSUM80 is used to compare closely related sequences, while BLOSUM45 is suitable for diverged proteins. An average BLOSUM often used is BLOSUM62.

47 Note: The numbers n and x in PAMn and BLOSUMx play opposite roles. Higher values of n and lower values of x both correspond to longer evolutionary distances!! The n counts the steps in the Markov chain used for the evolutionary model. The x tells up to which percentage of similarity between two sequences in a block the sequences are seen as different.

48 PAM vs. BLOSUM Advantages of BLOSUM: Simpler model. Observation-based, more or less independent of other models and concepts (no Markov chain assumption, trees, maximum parsimony needed). Tests suggest that BLOSUM matrices generally are superior to PAM matrices for detecting biological relationships (even if same amounts of data are used). Advantages of PAM: Yields an explicit evolutionary model as a by-product. Helps better understanding biological relations.

49 WAG and WAG* matrices (Whelan and Goldman, 2001) WAG/WAG* is the skilful combination of an approximate maximum likelihood (ML) method; a counting method JTT (Jones, Taylor and Thornton 1992; essentially an improved PAM) that empolys most parsimonious trees. Both ML and counting methods are not satisfactory alone: ML is very expensive and can be applied to only 10-30 sequences at a time. Counting methods using maximum parsimony have essentially all backdraws of PAM.

50 Assumptions taken in the WAG-models All amino acid sites evolve independently according to the same Markov process. The Markov process is stationary, time homogeneous and reversible. Transition probabilities p jk between amino acids stay constant over near-optimal branch lengths and tree topologies (i.e., given that the real tree is known). JTT gives near-optimal trees (i.e., optimal branch lengths).

51 In the WAG-models the transition matrix P is optimized by maximum likelihood, under the assumption that the optimal tree-topologies T = (T 1,...,T n ) are known for all aligned protein families D = (D 1,...,D n ). Then maximize the sum: log L = n f amilies i=1 log(p T i, D i ) Reference to original WAG-paper: S. Whealan, N. Goldman: A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach, Mol.Biol.Evol. 18(5):691-699, 2001