Evolutionary Change in Nucleotide Sequences. Lecture 3

Evolutionary Change in Nucleotide Sequences Lecture 3 1

So far, we described the evolutionary process as a series of gene substitutions in which new alleles, each arising as a mutation ti in a single individual, id progressively increase their frequency and ultimately become fixed in the population. 2

We may look at the process from a different point of view. An allele that becomes fixed is different in its sequence from the allele that it replaces. That is, the substitution of a new allele for an old one is the substitution of a new sequence for a previous sequence. 3

If we use a time scale in which one time unit is larger than the time of fixation, then the DNA sequence at any given locus will appear to change with time. actgggggtaaactatcggtatagatcataa g actgggggttaactatcggtatagatcataa actgggggttaactatcggtatagatcataa tt t t t t t t actgggggttaactatcggtatagatcataa g actgggggtgaactatcggtatagatcataa actgggggtgaactatcggtacagatcataa 4

To study the dynamics of nucleotide substitution, we must make several assumptions regarding the probability of substitution of a nucleotide by another. 5

Jukes &C Cantor s one-parameter model 6

Assumption: Substitutions occur with equal probabilities among the four nucleotide types. 7

If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, P, that A(t) this site will be occupied by A at time t? 8

Since we start with A, P A(0) = 1. At time 1, the probability of still having A at this site is P = 1 3α A (1) where 3α is the probability of A changing g to T, C, or G, and 1 3α is the probability that A has remained unchanged. 9

To derive the probability of having A at time 2, we consider two possible scenarios: 10

1. The nucleotide has remained unchanged from time 0 to time 2. 11

2. The nucleotide has changed to T, C, or G at time 1, but has subsequently reverted to A at time 2. 12

P = (1 3α ) P + α 1 A ( 2) A (1) P A(1) ) 13

The following equation applies to any t and any t+1 P = (1 3)P 3a)P + a 1 P A(t + 1) A(t) A(t) 14

We can rewrite the equation in terms of the amount of change in P A(t) per unit time as: P = P P = 3aP + a 1 P = 4aP + A(t) A(t + 1) A(t) A(t) A(t) A(t) a 15

We approximate the discrete-time process by a continuous-time model, by regarding P A(t) as the rate of change at time t. dp A(t ) = 4αP +α dt = 4αP A(t ) +α 16

The solution is: P = 1 + P 1 e 4at A(t) 4 A(0) 4 17

1 1 4at P = + P e A(t) 4 A(0) 4 If we start with A, the probability that the site has A at time 0 is 1. Thus, P A(0) = 1, and A(0) consequently, 1 3 4 t P = 1 + 3 A(t) 4 4 e 4at 18

1 1 4at P = + P e A(t) 3 A(0) 4 If we start t with non A, the probability that the site has A at time 0 is 0. Thus, P = 0, and A(0) consequently, P = 1 1 4at A(t) 4 4 e 19

In the Jukes and Cantor model, the probability of each of the four nucleotides at equilibrium (t = ) is 1/4. P A(0) = 1 : P A(t) = 1 3 + e 4at 4 4 P 1 1 P = 0 : = e 4at A(0) A(t) 4 4 20

So far, we treated P A(t) as a probability. However, P A(t) can also be interpreted as the frequency of A in a DNA sequence at time t. For example, if we start with a sequence made of adenines only, then P A(0) = 1, and P A(t) is the expected frequency of A in the sequence at time t. The expected frequency of A in the sequence at equilibrium will be 1/4, and so will the expected frequencies of T, C, and G. 21

After reaching equilibrium no further change in the nucleotide frequencies is expected to occur. However, the actual frequencies of the nucleotides will remain unchanged only in DNA sequences of infinite length. In practice, fluctuations in nucleotide frequencies are likely to occur. 22

NUMBER OF NUCLEOTIDE SUBSTITUTIONS BETWEEN TWO DNA SEQUENCES 26

After two nucleotide sequences diverge from each other, each of them will start accumulating nucleotide substitutions. If two sequences of length N differ from each other at n sites, then the proportion of differences, n/n, is referred to as the degree of divergence or Hamming distance. Degrees of divergence are usually expressed as percentages (n/n 100%). 27

The observed number of differences is likely to be smaller than the actual number of substitutions due to multiple utpehitsat the same site. 29

13 mutations = 3 differences 30

Number of substitutions between two noncoding (NOT protein coding) sequences 32

The one-parameter model In this model, it is sufficient to consider only I (t), which is the probability bilit that t the nucleotide at a given site at time t is the same in both sequences. 33

I = 1 3 8αt (t) 4 + 4 e where I (t) is the proportion of identical nucleotides between two sequences that diverged t time units ago. 1 3 4at P A(t) = 1 4 + 3 4 e 4at 34

The probability that the two sequences are different at a site at time t is p = 1 I (t). 8 p = 3 1 e 8αt 4 t is usually not known and, thus, we cannot estimate α. Instead, we compute K, which is the number of substitutions per site since the time of divergence between the two sequences. 35

p = 3 1 e 8αt 4 L = number of sites compared between ee the two sequences. 37

Jukes & Cantor s one-parameter model 38

Kimura s two- parameter model 40

Assumptions: The rate of transitional substitution at each nucleotide site is α per unit time. The rate of each type of transversional substitution is β per unit time. 41

α β 5-10 42

If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, P, that A(t) this site will be occupied by A at time t? 43

After one time unit the probability of A changing into G is α, the probability of A changing into C is β, and the probability of A changing g into T is β. Thus, the probability of A remaining unchanged after one time unit is: P = 1 α α 2β AA(1) ( ) 44

To derive the probability of having A at time 2, we consider four possible scenarios: 45

1. A remained unchanged at t = 1 and t = 2 46

2. A changed into G at t = 1 and reverted by a transition to A at t = 2 47

3. A changed into C at t = 1 and reverted by a transversion to A at t = 2 48

4. A changed into T at t = 1 and reverted by a transversion to A at t = 2 49

P = (1 α α 2β)P + βpβ + βpβ +αpα AA(2) AA(1) TA(1) CA(1) GA(1) 50

By extension we obtain the following recurrence equation for the general case: P AA(t +1) = (1 α α 2β)P AA(t) + βp TA(t) + βp CA(t) +αp GA(t) 51

After rewriting this equation as the amount of change in P AA(t) per unit time, and after approximating the discrete-time model by the continuous-time model, we obtain the following differential equation dp AA(T ( ) = (α+ ( 2β)P + βp + βp +αpp dt AA(t) TA(t) CA(t) GA(t) 52

Similarly, we can obtain equations for P TA(t), P CA(t),and P GA(t), and from this set of four GA(t) equations, we arrive at the following solution 4βt 2(α+ P = 1 AA(t) 4 + 1 4 e 4βt + 1 β)t e 2(α+ 2 4at P = 1 A(t) 4 + 3 4 e 4at 53

In the Jukes-Cantor model: P AA(t) = P GG(t) = P CC(t) = P TT(t) Because of the symmetry of the substitution scheme, this equality also holds for Kimura's two-parameter model. 54

3 probabilities X (t) = The probability that a nucleotide at a site at time t is identical to that at time 0 X = 1 4βt 2(α+ + 1 e 4βt + 1 β)t e 2(α+ (t) 4 4 2 At equilibrium, the equation reduces to X ( ) = 1/4. Thus, as in the case of Jukes and Cantor's model, the equilibrium frequencies of the four nucleotides are 1/4. 55

3 probabilities Y (t) () = The probability that the initial nucleotide and the nucleotide at time t differ from each other by a transition. Because of the symmetry of the substitution scheme, Y (t) = P AG(t) = P GA(t) = P TC(t) = P CT(t). = 1 1 4βt 1 2(α+ β)t Y (t) 4 + 4 e 2 e 56

Z = The probability that the (t) nucleotide at time t and the initial nucleotide differ by a specific type of transversion is given by 3 probabilities 4βt Z = 1 1 e (t) 4 4 57

Each nucleotide is subject to two types of transversion, but only one type of transition. Therefore, the probability that the initial nucleotide and the nucleotide at time t differ by a transversion is twice the probability that differ by a transition X (t) + Y (t) + 2Z (t) = 1 58

Number of substitutions between two noncoding (NOT protein coding) sequences 59

The differences between two sequences are classified into transitions and transversions. P = proportion of transitional differences Q = proportion of transversional differences 60

2 V(K) = 1 L P 1 1 2P Q + Q 1 2 4P 2Q + 1 2 4Q 2 P 2 1 2P Q + Q 2 4P 2Q + Q 2 4Q 64

Numerical example (2P-model) 66

There are substitution tut schemes with more than two parameters! 67

Number of substitutions between two protein-coding genes 68

Computing the number of substitutions between two protein-coding sequences is more complicated, because a distinction should be made between synonymous and nonsynonymous y substitutions. 69

Number of synonymous substitutions Number of synonymous sites Number of nonsynonymous substitutions Number of nonsynonymous sites 70

Aims: 1. Compute two numerators: The numbers of synonymous y and nonsynonymous substitutions. 2. Compute two denominators: The numbers of synonymous and nonsynonymous sites. 72

Difficulties with denominator: 1. The classification of a site changes with time: For example, the third position of CGG (Arg) is synonymous. However, if the first position changes to T, then the third position of the resulting codon, TGG (Trp), becomes nonsynonymous. T Trp Nonsynonymous 73

Difficulties with denominator: 2. Many sites are neither completely synonymous nor completely nonsynonymous. For example, a transition in the third position of GAT (Asp) will be synonymous, while a transversion to GAG or GAA will alter the amino acid. 74

Difficulties with numerator: 1. The classification of the change depends on the order in which the substitutions had occurred. 75

Difficulties with numerator: 1. When two homologous codons differ from each other by two substitutions or more the order of the substitutions must be known in order to classify substitutions into synonymous and nonsynonymous. Example: CCC in sequence 1 and CAA in sequence 2. Pathway I: CCC (Pro) CCA (Pro) CAA (Gln) 1 synonymous and 1 nonsynonymous 76

Difficulties with numerator: 2. Transitions occur with different frequencies than transversions. 3. The type of substitution depends on the mutation. ti Transitions result more frequently in synonymous y substitutions than transversions. 77

Miyata & Yasunaga (1980) and Nei & Gojobori (1986) method 78

1. Classification of sites. Consider a particular position in a codon. Let i be the number of possible synonymous changes at this site. Then this site is counted as i/3 synonymous and (3 i)/3 nonsynonymous. 79

In TTT (Phe), the first two positions are nonsynonymous, because no synonymous change can occur in them, and the third position is 1/3 synonymous and 2/3 nonsynonymous because one of the three possible changes is synonymous. 80

2. Count the number of synonymous and nonsynonymous sites in each sequence and compute the averages between the two sequences. The average number of synonymous y sites is N S and that of nonsynonymous sites is N A. 81

3. Classify nucleotide differences into synonymous y and nonsynonymous differences. 82

For two codons that differ by only one nucleotide, the difference is easily inferred. For example, the difference between the two codons GTC (Val) and GTT (Val) is synonymous, while the difference between the two codons GTC (Val) and GCC (Ala) is nonsynonymous. 83

For two codons that differ by two or more nucleotides, the estimation problem is more complicated, because we need to determine the order in which the substitutions occurred. 85

Pathway (1) requires one synonymous and one nonsynonymous change, whereas pathway (2) requires two nonsynonymous 86 changes.

There are two approaches to deal with multiple substitutions at a codon: 87

The unweighted method: Average the numbers of the different types of substitutions for all the possible scenarios. For example, if we assume that the two pathways are equally likely, then the number of nonsynonymous differences is (1 + 2)/2 = 1.5, and the number of synonymous differences is (188+ 0)/2 = 0.5.

The weighted method. Employ a priori criteria to assign the probability of each pathway. For instance, if the weight of pathway 1 is 0.9, and the weight for pathway 2 is 0.1, then the number of nonsynonymous differences between the two codons is (0.9 1) + (0.1 2) = 1.1, and the number of 89 synonymous differences is 0.9.

4. The numbers of synonymous and nonsynonymous y differences between the two protein- coding sequences are M S and M A, respectively. 91

The number of synonymous differences per synonymous site is p S = M S /N S The number of nonsynonymous y differences per nonsynonymous site is p A = M A /N A 92

If we take into account the effect of multiple hits at the same site, we can make corrections by using Jukes and Cantor's formula: 93

3 4 M K = ln 1 S S 4 3N S 95

3 4 M K = ln 1 A A 4 3NA A 96

Number of Amino-Acid Replacements between Two Proteins The observed proportion of different amino acids between the two sequences (p) is p = n /L n = number of amino acid differences between the two sequences L = length of the aligned sequences. 98

Number of Amino-Acid Replacements between Two Proteins The Poisson model is used to convert p into the number of amino replacements between two sequences (d ): d = - ln(1 p) The variance of d is estimated as V(d) ( ) = p/l (1 p) 100

ALIGNMENT OF NUCLEOTIDE & AMINO-ACID ACID SEQUENCES 101

102

Homology: The term was coined by Richard Owen in 1843. Definition: Similarity resulting from common ancestry. 103

Homology: A qualitative statment Homology designates a relationship of common descent between entities Two genes are either homologs or not it doesn t make sense to say two genes are 43% homologous. it doesn t make sense to say Linda is 24% pregnant. 104

Homology By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor. 105

Homology When dealing with sequences, we are interested t in POSITIONAL HOMOLOGY. We identify positional homology by ALIGNMENT. 106

ACTGGGCCCAAATC 1 deletion 1 substitution 1 insertion 1 substitution CTGGGCCCAGATC AACAGGGCCCAAATC Correct alignment --CTGGGCCCAGATC AACAGGGCCCAAATC *.*******.********** *** Incorrect alignment CTGGGCCCAGATC-- AACAGGGCCCAAATC...*..*..*.. * * 107

Unknown! unknown processes unknown processes CTGGGCCCAGATC AACAGGGCCCAAATC Correct alignment? --CTGGGCCCAGATC AACAGGGCCCAAATC *.*******.********** *** Incorrect alignment? CTGGGCCCAGATC-- AACAGGGCCCAAATC...*..*..*.. * * 108

ACCTGAATTTGCCC T9 -A6 G5T -A7 +ACA12 T8A +G2 ACCTTAATTGCACACC ACCTTAATTGCACACC AGCCTGATTGCCC--- AGCCTGATTGCCC C2G, T4C, A6G, A12C, -ACC14 109

Alignment: nt A hypothesis s concerning ng positional homology among residues in a sequence. Positional homology = A pair of nucleotides from two aligned sequences that have descended from one nucleotide in the ancestor of the two sequences. 110

An alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs: (1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other. GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG ***..*****.*.******* * 111

Sequence alignment = The identification of the location of deletion or insertions that might have occurred in either of the two lineages since their divergence from a common ancestor. Insertion +Deletionetion = Indel Indel or or Gap 112

Sequence alignment 1. Pairwise alignment 2. Multiple alignment 113

- Two DNA sequences: A and B. - Lengths are m and n,, respectively. - The number of matched pairs is x. - The number of mismatched pairs is y. - Total number of bases in gaps is z. 114

There are terminal and internal gaps. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG 115

A terminal gap may indicate missing data. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG 116

An internal gap indicates that a deletion or an insertion has occurred in one of the two lineages. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG 117

The alignment is the first step in many evolutionary and functional studies. Errors in alignment tend to amplify in later computational stages. 118

Methods of alignment: 1. Manual 2. Dot matrix 3. Distance Matrix 4. Combined (Distance + Manual) 119

Manual alignment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. GCG-TCCATCAGGTAGTTGGTGTG GCGTTCCATCAGGTGGTTGGTGTG *** **********.********* 120

Advantages of manual alignment: (1) use of a powerful and trainable tool (the brain, well, some brains). (2) ability to integrate additional data, e.g., domain structure, biological function. 121

122

123

Protein Alignment may be guided by Tertiary Structures Escherichia coli DjlA protein Homo sapiens DjlA protein 124

Disadvantages of manual alignment: (1) The method is subjective and unscalable. 125

The dot-matrix method: The two sequences are written out as column and row headings of a two- dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical. 126

The alignment is defined by a path from the upper-left element to the lowerright element. 127

There are 4 possible steps in the path: (1) a diagonal step through a dot = match. (2) a diagonal step through an empty element of the matrix = mismatch. (3) a horizontal step = a gap in the sequence on the top of the matrix. (4) a vertical step = a gap in the sequence on the left of the matrix. 128

forbidden directions allowed directions 129

A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone. 130

window size =1 stringency ti = 1 alphabet size = 4 The number of spurious matches is determined by: window size, stringency, & alphabet size. 131

window size =1 stringency ti = 1 alphabet size = 4 window size = 3 stringency ti = 2 alphabet size = 4 132

window size = 1 stringency = 1 alphabet size = 20 133

Dot-matrix methods: Advantages: May unravel information on the evolution of sequences. 134

Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. 135

Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The two diagonally oriented parallel lines most probably indicate that a small internal duplication has occurred in the bacterial gene. 136

Dot-matrix methods: Disadvantage: May not identify the best alignment. 137

Distance and similarity methods 138

The best possible alignment (optimal alignment) is the one in which the numbers of mismatches and gaps are minimized i i according to certain criteria. 139

Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa. 140

α = matches th β = mismatches γ = nucleotides in gaps δ = gaps 141

Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are multiplied to make the gaps equivalent in value to the mismatches. The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions. btitti 142

Mismatch penalty is an assessment of how frequently substitutions occur. 143

The distance (dissimilarity) index (D) between two sequences in an alignment is D= m y + i i w z k k where y i is the number of mismatches of type i, m i is the mismatch penalty for an i-type of mismatch, z k is the number of gaps of length k, and w k is a positive number representing the penalty for gaps of length k. 144

The similarity index (S) between two sequences in an alignment is: S = x + w z k k where x is the number of matches, z k is the number of gaps of length k, and w k is a positive number representing the penalty for gaps of 145 length k.

The gap penalty has two components: a gap-opening penalty and a gap-extension penalty. 146

Three main systems: (1) Fixed gap-penalty system = 0 gap-extension costs. (2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1. (3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower. 147

148

Further complications: Distinguishing among different matches and mismatches. For example, a mismatched pair consisting of Leu & Ile, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting ss of Arg & Glu, Gu, which are very dissimilar from each other. 149

Lesser penalty than 150

Alignment algorithms 151

Aim: Find the alignment associated with the smallest D (or largest S) from among all possible alignments. 152

The number of possible alignments may be astronomical. For example, when two sequences 300 residues long each are compared, there are 10 88 possible alignments. In comparison, the number of elementary particles in the universe is only ~10 80. 153

There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities. 154

The Needleman-Wunsch algorithm uses Dynamic Programming g 155

Dynamic programming = a computational ti technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution. 156

Multiple l Sequence Alignment 157

Alignments can be easy or difficult GCGGCCCA TCAGGTAGTT GGTGG GCGGCCCA TCAGGTAGTT GGTGG GCGTTCCA TCAGCTGGTT GGTGG GCGTCCCA TCAGCTAGTT GGTGG GCGGCGCA TTAGCTAGTT GGTGA ***...** *.**.*.*** ****. TTGACATG CCGGGG---A AACCG T-GACATG CCGGTG--GTGT AAGCC TTGGCATG -CTAGG---A ACGCG TTGACATG -CTAGGGAAC ACGCG TTGACATC -CTCTG---A ACGCG * *.***. *... *. *..*. Easy Difficult 158

159

Multiple Alignment 2 methods: Dynamic programming (exhaustive, exact) Consider 2 protein sequences of 100 amino acids in length. If it takes 100 2 seconds to exhaustively align these sequences, then it will take 100 3 seconds to align 3 sequences, 100 4 to align 4 sequences...etc. More time than the universe has existed to align 20 sequences exhaustively. Progressive alignment (heuristic, approximate) 160

Progressive Alignment Devised by Feng and Doolittle in 1987. Essentially a heuristic method and as such is not guaranteed to find the optimal alignment. Requires n-1+n-2+n-3...n-n+1 n n+1 pairwise alignments as a starting point Most successful implementation ti is Clustal l (Des Higgins) 161

Overview of Clustal Procedure CLUSTAL Hbb_Human 1 - Hbb_Horse 2.17 - Hba_Human 3.59.60 - Hba_Horse 4.59.59.13 - Myg Whale 5.77.77.75.75-1. Quick pairwise alignments 2. Distances for each pair 3. Distance matrix Hbb_Human Hbb_Horse Hba_Human Hba_HorseHorse 1 3 4 Neighbor-joining tree 2 (guide tree) Myg_Whale 1 PEEKSAVTALWGKVN--VDEVGG 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 4 AADKTNVKAAWSKVGGHAGEYGA 5 EHEWQLVLHVWAKVEADVAGHGQ 1 3 4 Progressive alignment 2 following guide tree 162

Clustal Clustal: good points/bad points Advantages: Speed. Disadvantages: No way of knowing if the alignment is correct correct. 163

Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone (a) Penalty for gaps is 0 (b) Penalty for a gap of size k nucleotides is w k = 1 + 0.1k (c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by 164 showing pairs of biochemically similar amino acids

An Alignment GCGGCTCA TCAGGTAGTT GGTG-G GCGGCCCA TCAGGTAGTT GGTG-G GCGTTCCA TC--CT-GTT GGTGTG GCGTCCCA TCAGCTAGTT GTTG-G GCGGCGCA TTAGCTAGTT GGTG-A ***...** *.* *** *.****. Spinach Rice Mosquito Monkey Human 165