Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Size: px

Start display at page:

Download "Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)"

Isaac Blair
6 years ago
Views:

1 Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from a course by Dr N Pétrélis 1

2 Last time: Theory of Markov chains and applications in different models. Often we modeled the development over different positions in one DNA sequence (states: nucleotides, index: position in sequence). This time: Use Markov chains to model development of individual positions in a DNA/protein sequence over time (states: nucleotides/amino acids, index: time). 2

3 Objectives for today: Study models for sequence evolution; Derive good substitution matrices, ie matrices that give a realistic score to each possible substitution in a DNA or protein sequence. 3

4 Model for DNA sequence evolution Each site of the DNA sequence evolves according to a Markov chain with state space {a, c, g, t}: Each Markov chain is independent. Each Markov chain has the same transition probabilities. 4

5 Simplest model for sequence evolution: Cantor β = 1 3α. p aa p ac p ag p at p ca p cc p cg p ct p ga p gc p gg p gt p ta p tc p tg p tt = β α α α α β α α α α β α α α α β Jukes Stationary distribution: π = (0.25, 0.25, 0.25, 0.25). We need α < 1/3. The parameter α depends on the time scale. As the real time represented by one step increases, so does α. 5

6 The n step transition probabilities can be calculated: P (X n = i X 0 = i) = (1 4α)n, P (X n = j X 0 = i) = (1 4α)n, for i, j {a, c, g, t}, i j. The Jukes Cantor model is not entirely realistic (all types of substitutions are equally likely to occur). A more complicated and more realistic model is the Kimura model... 6

7 P = 1 α 2β β α β β 1 α 2β β α α β 1 α 2β β β α β 1 α 2β. We need α + 2β < 1. Stationary distribution: π = (0.25, 0.25, 0.25, 0.25). α: probability of transition (pyrimidine to pyrimidine or purine to purine). β: probability of transversion (purine to pyrimidine or pyrimidine to purine). (purine: a, g; pyrimidine: c, t) 7

8 There are even more realistic Markov models for DNA substitution (for example Hasegawa, Kishino, Yano, and many others) Most models assume that sites evolve independently (which is not entirely realistic). Some models allow different sites to evolve at different rates. Why not use more realistic models? Because they are very difficult to handle! The more complicated the model, the harder it is to compute the probabilities of interest. The more complicated the model, the more parameters we need to estimate. Simpler models seem to give sensible results. As always in mathematical modeling, we need a balance between realism and mathematical tractability. 8

9 Evolutionary models for proteins? Similar, but here it is much more important to account for the wide range of different transition probabilities associated with amino acid substitutions. See the construction of substitution matrices. 9

10 Scoring systems In bioinformatics we are very interested in comparing DNA or protein sequences. For example, in inferring molecular function by finding similarities to a sequence with a known function. For this purpose, we need a good alignment of given sequences. For this purpose, we need a measure for judging the quality of a given alignment against other possible alignments. This is a scoring system. 10

11 Additive scoring system: Look at each position of a given alignment and assign a score for the quality of the match at that position. Ignore gaps for now! The total (or cumulative) score is obtained by adding the scores for the individual positions. Example: Two DNA sequences. Score +1 for a match, 1 for a mismatch. a a g t t t c t t g a a a c t c c c t g Cumulative score: 6-4=2. 11

12 More realistic system? Score +1 for a match, -1/2 for a transition and -1 for a transversion. Cumulative score in this case: 6-2=4. Scoring matrices The scores for individual positions can be displayed in a substitution matrix (also called a scoring matrix). This is usually a symmetric 4 4 (DNA) or (protein) matrix which has as entry (i, j) the score that we assign if the nucleotides (or amino acids) i and j are aligned. 12

13 For example, with our scoring system on the previous slide, we get the scoring matrix S given by s aa s ac s ag s at s ca s cc s cg s ct s ga s gc s gg s gt s ta s tc s tg s tt = / /2 1/ /

14 A biologically sensible scoring matrix? For DNA sequences: simple scoring matrices (like the one presented) are often effective. Usually no need to worry! For protein sequences: some substitutions are clearly more likely to occur than others (presumably due to chemical properties of the amino acids), for example isoleucine for valine, serine for threonine. These are conservative substitutions. We get better alignments if we take this into account. Use scoring matrices that are derived by statistical analysis of protein data. 14

15 A biologically sensible scoring matrix for proteins Identical amino acids should be given a greater score than any substitution; Conservative substitutions should be given a greater score than non conservative ones; Different sets of values may be desired for comparing very similar sequences (eg homologies in mouse and rat) as opposed to highly divergent sequences (eg homologies in mouse and yeast). That is, we usually want our scoring matrix to take into account the evolutionary distance between our sequences. 15

16 Two frequently used approaches: 1. The PAM family of substitution matrices Uses Markov chains and phylogenetic trees (to fit an evolutionary model) and log likelihood ratios (for obtaining a scoring matrix from an estimated transition matrix). 2. The BLOSUM family of substitution matrices Uses log likelihood ratios (for obtaining a scoring matrix from a matrix of estimated substitution probabilities). 16

17 The PAM family (Dayhoff, Schwartz and Orcutt, 1978) PAM = Point (or Percentage) Accepted Mutations Accepted point mutation : a substitution of one amino acid for another that is accepted by evolution. That is, within some given species, the mutation has (over time) spread to essentially the entire species. Two types of matrices involved: the PAM Markov transition matrix (estimated transition matrix for underlying model) and the PAM substitution matrix (giving us our scores). 17

18 The underlying model: Each site in the sequence evolves according to a Markov chain, independently of the other sites. All Markov chains have the same (20 20) transition matrix P. P is estimated from protein sequence data. A PAM1 transition matrix is the Markov chain transition matrix applying for a time over which we expect 1% of the amino acids to undergo accepted point mutations. 18

19 To estimate the transition matrix: Find reliable data and align protein sequences that are at least 85% identical; Reconstruct phylogenetic trees and infer ancestral sequences; Count the amino acid replacements that occurred along the trees (count mutations accepted by natural selection); Use these counts to estimate probabilities of replacements. 19

20 Dayhoff et al (1978) use ungapped multiple alignments of well conserved regions from closely related proteins (71 groups of proteins, with 1572 changes in total). In any block, two sequences did not differ by more than 15%. We try to keep the number of sites that have encountered several changes low. These aligned regions are used to find the underlying evolutionary tree(s). We want the most parsimonious trees: those with the fewest substitutions. There may be more than one! 20

21 Why do we use trees? To avoid overcounting. Our count might be biased by closely related sequences that are overrepresented in our database. Trees give sequences that are grouped in the right way. Very similar sequences tend to succeed each other in the tree. We mainly have transitions between these sequences, and only a few transitions to other, more different, sequences, so the corresponding substitutions are not given unnatural importance. 21

22 Example Suppose we are given a block of three sequences: AA, AB and BB. There are 5 most parsimonious trees which lead to these three sequences as their leaves: We then count the number of amino acid substitutions of each type that occur in the trees... 22

23 A is substituted for B (or vice versa) twice in each of the five trees. This is an A B total (and B A total) of 10. Divide by the number of trees (5) to get the count 2. A is aligned with A a total of 15 times over the five trees. Each A A alignment gives a count of 2, so we get 30. Divide by the number of trees to get 6. Similar calculations for B B also gives 6. We can form a matrix: ( A A A B B A B B ) = ( ) 23

24 Suppose the amino acids are numbered 1 to 20. Just as in the example above, we can form a count matrix: A = A j,k is the j k count A 1,1 A 1,2 A 1,20 A 2,1 A 2,2 A 2,20.. A 20,1 A 20,2 A 20,20 In general there will be more than one block: add the counts from each block to get the final count. The count matrix A is used to estimate the transition probabilities... 24

25 For any pair (j, k) define a j,k = A j,k 20 m=1 A. j,m These are estimated probabilities. To get the transition matrix P = (p jk ) 20 20, we scale the a j,k in a certain way. Let c be a positive scaling constant and set p jk = c a j,k, j k, p jj = 1 k j c a j,k. It follows that k p jk = 1. We need to choose c small enough that p jj 0 for all j. 25

26 Why the scaling factor c? To account for the evolutionary distance. We choose a value of c which gives a transition matrix useful for short evolutionary periods. More precisely, choose a value c such that 1% of the amino acids are expected to undergo accepted point mutations during one time unit. This is an evolutionary distance of 1PAM. 26

27 Consider a particular site in the sequence. Recall: we label the amino acids 1,, 20. Let Z n be the amino acid present at the site at time n. The probability that the site will change after one time step is P (Z 1 Z 0 ) = = 20 j=1 20 j=1 20 j=1 P (Z 0 = j, Z 1 j) P (Z 1 j Z 0 = j)p (Z 0 = j) P (Z 1 j Z 0 = j) q j where q j is the observed frequency of amino acid j in the original block of aligned proteins. 27

28 We want the probability of a change to be 0.01 (an average change of 1%): 0.01 = = = = c 20 j=1 20 P (Z 1 j Z 0 = j) q j k j P (Z 1 = k Z 0 = j) j=1 20 p jk q j j=1 k j 20 c a j,k q j j=1 k j 20 j=1 k j q j a j,k. q j 28

29 So, we take c = 20 j= k j q j a j,k How can we turn our transition matrix into a scoring matrix? 29

30 Consider the two protein sequences s = a 1 a 2 a n and s = b 1 b 2 b n (with an evolutionary distance of 1 PAM). The score for aligning s with s is found by comparing the null and alternative hypotheses H 0 : s and s are not evolutionarily related (a chance alignment). H 1 : s and s are evolutionarily related (s depends on s via the Markov model). 30

31 Under H 0 : We have a chance alignment. That is, all sites in both sequences are randomly generated, and all the sites are independent of each other. Suppose amino acid j appears with probability q j. The probability of getting this chance alignment is P H0 (the alignment) = = n i=1 n i=1 q ai ( qai q bi ). n i=1 q bi 31

32 Under H 1 : The sites in the sequence are dependent, according to the Markov model described earlier. For example, P (site 3 changes from A to B in 1 time step) = p AB, the one step transition probability. Then P H1 (align A and B at site 3) = q A p AB. Since all sites behave independently: P H1 (the alignment) = n i=1 q ai p ai b i. 32

33 We want our score to reflect the chance that with s and s we have aligned evolutionarily related sequences. That is, we want the score to be high if the chance is high that we have aligned related sequences. A natural choice for the score is a comparison of the probability of the alignment under H 0 and H 1. The likelihood ratio: Score = P H 1 (the alignment) P H0 (the alignment) = = n i=1 n i=1 q ai p ai b i q ai q bi p ai b i q bi. 33

34 Equivalently (better for theoretical reasons), use the log likelihood ratio: ( ) PH1 (the alignment) Score = log P H0 (the alignment) = log = n i=1 n i=1 p ai b i q bi log ( pai b i q bi The entry at position (a, b) in the PAM substitution matrix is then S a,b = log ). ( ) pab (or rounded to the nearest integer for convenience). q b, 34

35 Using the logarithm, we have obtained our additive scoring system. With alignments s = a 1 a 2 a n and s = b 1 b 2 b n we get S = Total Score = n i=1 S ai,b i. Adding the individual scores is equivalent to multiplying the probabilities, thanks to the logarithm: ( ) PH1 (the alignment) S = log P H0 (the alignment) = = n i=1 n i=1 log ( pai b i q bi S ai,b i. ) 35

36 Note that S a,b < 0 p ab q b < 1 q ap ab q a q b < 1 q a p ab < q a q b, that is, if we are more likely to see a and b aligned against each other in a random alignment than to see a and b aligned in a comparison of two related sequences (at PAM 1 distance). Otherwise, S a,b = log ( ) pab q b 0. 36

37 PAMn substitution matrices? For sequences having an evolutionary distance of n PAM units. This does not mean that we expect n% of the amino acids to differ: substitutions can occur at the same site many times! Let P be the 1 PAM transition matrix. As always, the the n step transition probabilities p (n) ab are the entries of the matrix P n. The corresponding scores are S (n) a,b = log p(n) ab q b. 37

38 The BLOSUM family (Henikoff and Henikoff, 1992) BLOSUM = BLOcks SUbstitution Matrices Again the scores will be logarithms of likelihood ratios, but this time there are no evolutionary models, and so no Markov chains and no trees. The likelihoods are obtained by statistical analysis of blocks of aligned sequences. 38

39 Blocks of aligned sequences The blocks needed stem from an ungapped multiple alignment of a relatively highly conserved region of a family of proteins. This is different kind of data to that used in the PAM family. H&H s data was far more extensive: several hundred groups of proteins, at least 2369 occurrences of any particular substitution. difference in concept: Dayhoff et al. used data from closely related proteins and extrapolated (PAM1 to PAMn). H&H directly used protein sequences regardless of their evolutionary distance. 39

40 How are the likelihoods obtained? We count the proportion of times p a that the amino acid a occurs somewhere in the block; the proportion of times p ab that the amino acid pair (a, b) occurs in the same column of any block. NB: a ( and ) b are not necessarily distinct. There ( are ) = 210 pairs of amino acids, and N m 2 pairs that have to be taken into account in all the blocks, if each block has the same number of rows. N is the number of columns in all blocks together, and m is the number of rows in each block. 40

41 p ab is the likelihood (or estimated probability) for the substitution a b under the assumption the sequences are related. p a p b is the likelihood (or estimated probability) for the substitution a b (also for b a) under the assumption that the sequences are not related. Thus, for a b, the likelihood for the substitution a b is 2p a p b. A scoring matrix is obtained using the same ideas as for the PAM matrix. Set the entry at position (a, b) to be S a,b = ( ) 2 log pab 2 2p a p if a b, b 2 log 2 ( paa p 2 a ) if a = b. 41

42 2 log 2 was used by H&H; using log would give essentially the same score. This is not yet the BLOSUM we are looking for... To obtain the initial blocks, a multiple alignment was found. A substitution matrix is needed for that! Solution...? 42

43 The circularity problem. Solution: iteration. H&H used a simple unit matrix for this first alignment (1 for a match, 0 for a mismatch). With the BLO- SUM matrix they obtained by the above procedure, they constructed a second set of blocks and a second BLOSUM matrix. Then, with this second matrix a third matrix was constructed. This is the matrix that is recommended to be used. This gives a BLOSUM100 matrix, provided that we have eliminated all identical copies of sequences from the original blocks. The BLOSUM100 is not very useful! 43

44 The overcounting problem. Solution: clustering The problem of overcounting is solved by clustering those sequences in each block that are sufficiently close. That is, we combine them in a skillful way and regard them as a single sequence. The result is a BLOSUMx matrix, where x determines what we mean by sufficiently close. We cluster the sequences that have x% (or more) in common. An average BLOSUM often used is BLOSUM62 (the default when you do a BLAST search on NCBI). 44

45 PAM vs. BLOSUM The circularity problem PAM: not addressed. BLOSUM: iterative procedure. The overcounting problem PAM: inferring of phylogenetic trees for each block. Substitutions are only counted along the edges of the trees. BLOSUM: clustering by the x% rule in each block. The evolutionary distance problem PAM: Markov chain theory: distances are accounted for by n step transition matrices for different n (higher distance = higher n). BLOSUM: clustering by the x% rule in each block (higher distance = lower x). 45

46 Note that the numbers n and x in PAMn and BLOSUMx play opposite roles. Higher values of n and lower values of x both correspond to longer evolutionary distances. The n counts the time steps in the Markov chain used for the evolutionary model. The x tells us up to what percentage of similarity two sequences in a block will be seen as different. 46

47 Advantages of BLOSUM Simpler model; Observation based and mostly independent of other models and concepts (eg Markov chains); BLOSUM matrices seem to be better than PAM matrices at detecting biological relationships (even if the same amount of data is used). 47

48 Advantages of PAM Gives an explicit evolutionary model as a by product; Helps to give a better understanding of biological relationships. 48

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff