Evolutionary Change in Nucleotide Sequences. Lecture 3

Size: px
Start display at page:

Download "Evolutionary Change in Nucleotide Sequences. Lecture 3"

Transcription

1 Evolutionary Change in Nucleotide Sequences Lecture 3 1

2 So far, we described the evolutionary process as a series of gene substitutions in which new alleles, each arising as a mutation ti in a single individual, id progressively increase their frequency and ultimately become fixed in the population. 2

3 We may look at the process from a different point of view. An allele that becomes fixed is different in its sequence from the allele that it replaces. That is, the substitution of a new allele for an old one is the substitution of a new sequence for a previous sequence. 3

4 If we use a time scale in which one time unit is larger than the time of fixation, then the DNA sequence at any given locus will appear to change with time. actgggggtaaactatcggtatagatcataa g actgggggttaactatcggtatagatcataa actgggggttaactatcggtatagatcataa tt t t t t t t actgggggttaactatcggtatagatcataa g actgggggtgaactatcggtatagatcataa actgggggtgaactatcggtacagatcataa 4

5 To study the dynamics of nucleotide substitution, we must make several assumptions regarding the probability of substitution of a nucleotide by another. 5

6 Jukes &C Cantor s one-parameter model 6

7 Assumption: Substitutions occur with equal probabilities among the four nucleotide types. 7

8 If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, P, that A(t) this site will be occupied by A at time t? 8

9 Since we start with A, P A(0) = 1. At time 1, the probability of still having A at this site is P = 1 3α A (1) where 3α is the probability of A changing g to T, C, or G, and 1 3α is the probability that A has remained unchanged. 9

10 To derive the probability of having A at time 2, we consider two possible scenarios: 10

11 1. The nucleotide has remained unchanged from time 0 to time 2. 11

12 2. The nucleotide has changed to T, C, or G at time 1, but has subsequently reverted to A at time 2. 12

13 P = (1 3α ) P + α 1 A ( 2) A (1) P A(1) ) 13

14 The following equation applies to any t and any t+1 P = (1 3)P 3a)P + a 1 P A(t + 1) A(t) A(t) 14

15 We can rewrite the equation in terms of the amount of change in P A(t) per unit time as: P = P P = 3aP + a 1 P = 4aP + A(t) A(t + 1) A(t) A(t) A(t) A(t) a 15

16 We approximate the discrete-time process by a continuous-time model, by regarding P A(t) as the rate of change at time t. dp A(t ) = 4αP +α dt = 4αP A(t ) +α 16

17 The solution is: P = 1 + P 1 e 4at A(t) 4 A(0) 4 17

18 1 1 4at P = + P e A(t) 4 A(0) 4 If we start with A, the probability that the site has A at time 0 is 1. Thus, P A(0) = 1, and A(0) consequently, t P = A(t) 4 4 e 4at 18

19 1 1 4at P = + P e A(t) 3 A(0) 4 If we start t with non A, the probability that the site has A at time 0 is 0. Thus, P = 0, and A(0) consequently, P = 1 1 4at A(t) 4 4 e 19

20 In the Jukes and Cantor model, the probability of each of the four nucleotides at equilibrium (t = ) is 1/4. P A(0) = 1 : P A(t) = e 4at 4 4 P 1 1 P = 0 : = e 4at A(0) A(t)

21 So far, we treated P A(t) as a probability. However, P A(t) can also be interpreted as the frequency of A in a DNA sequence at time t. For example, if we start with a sequence made of adenines only, then P A(0) = 1, and P A(t) is the expected frequency of A in the sequence at time t. The expected frequency of A in the sequence at equilibrium will be 1/4, and so will the expected frequencies of T, C, and G. 21

22 After reaching equilibrium no further change in the nucleotide frequencies is expected to occur. However, the actual frequencies of the nucleotides will remain unchanged only in DNA sequences of infinite length. In practice, fluctuations in nucleotide frequencies are likely to occur. 22

23 23

24 24

25 25

26 NUMBER OF NUCLEOTIDE SUBSTITUTIONS BETWEEN TWO DNA SEQUENCES 26

27 After two nucleotide sequences diverge from each other, each of them will start accumulating nucleotide substitutions. If two sequences of length N differ from each other at n sites, then the proportion of differences, n/n, is referred to as the degree of divergence or Hamming distance. Degrees of divergence are usually expressed as percentages (n/n 100%). 27

28 28

29 The observed number of differences is likely to be smaller than the actual number of substitutions due to multiple utpehitsat the same site. 29

30 13 mutations = 3 differences 30

31 31

32 Number of substitutions between two noncoding (NOT protein coding) sequences 32

33 The one-parameter model In this model, it is sufficient to consider only I (t), which is the probability bilit that t the nucleotide at a given site at time t is the same in both sequences. 33

34 I = 1 3 8αt (t) e where I (t) is the proportion of identical nucleotides between two sequences that diverged t time units ago at P A(t) = e 4at 34

35 The probability that the two sequences are different at a site at time t is p = 1 I (t). 8 p = 3 1 e 8αt 4 t is usually not known and, thus, we cannot estimate α. Instead, we compute K, which is the number of substitutions per site since the time of divergence between the two sequences. 35

36 36

37 p = 3 1 e 8αt 4 L = number of sites compared between ee the two sequences. 37

38 Jukes & Cantor s one-parameter model 38

39 39

40 Kimura s two- parameter model 40

41 Assumptions: The rate of transitional substitution at each nucleotide site is α per unit time. The rate of each type of transversional substitution is β per unit time. 41

42 α β

43 If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, P, that A(t) this site will be occupied by A at time t? 43

44 After one time unit the probability of A changing into G is α, the probability of A changing into C is β, and the probability of A changing g into T is β. Thus, the probability of A remaining unchanged after one time unit is: P = 1 α α 2β AA(1) ( ) 44

45 To derive the probability of having A at time 2, we consider four possible scenarios: 45

46 1. A remained unchanged at t = 1 and t = 2 46

47 2. A changed into G at t = 1 and reverted by a transition to A at t = 2 47

48 3. A changed into C at t = 1 and reverted by a transversion to A at t = 2 48

49 4. A changed into T at t = 1 and reverted by a transversion to A at t = 2 49

50 P = (1 α α 2β)P + βpβ + βpβ +αpα AA(2) AA(1) TA(1) CA(1) GA(1) 50

51 By extension we obtain the following recurrence equation for the general case: P AA(t +1) = (1 α α 2β)P AA(t) + βp TA(t) + βp CA(t) +αp GA(t) 51

52 After rewriting this equation as the amount of change in P AA(t) per unit time, and after approximating the discrete-time model by the continuous-time model, we obtain the following differential equation dp AA(T ( ) = (α+ ( 2β)P + βp + βp +αpp dt AA(t) TA(t) CA(t) GA(t) 52

53 Similarly, we can obtain equations for P TA(t), P CA(t),and P GA(t), and from this set of four GA(t) equations, we arrive at the following solution 4βt 2(α+ P = 1 AA(t) e 4βt + 1 β)t e 2(α+ 2 4at P = 1 A(t) e 4at 53

54 In the Jukes-Cantor model: P AA(t) = P GG(t) = P CC(t) = P TT(t) Because of the symmetry of the substitution scheme, this equality also holds for Kimura's two-parameter model. 54

55 3 probabilities X (t) = The probability that a nucleotide at a site at time t is identical to that at time 0 X = 1 4βt 2(α+ + 1 e 4βt + 1 β)t e 2(α+ (t) At equilibrium, the equation reduces to X ( ) = 1/4. Thus, as in the case of Jukes and Cantor's model, the equilibrium frequencies of the four nucleotides are 1/4. 55

56 3 probabilities Y (t) () = The probability that the initial nucleotide and the nucleotide at time t differ from each other by a transition. Because of the symmetry of the substitution scheme, Y (t) = P AG(t) = P GA(t) = P TC(t) = P CT(t). = 1 1 4βt 1 2(α+ β)t Y (t) e 2 e 56

57 Z = The probability that the (t) nucleotide at time t and the initial nucleotide differ by a specific type of transversion is given by 3 probabilities 4βt Z = 1 1 e (t)

58 Each nucleotide is subject to two types of transversion, but only one type of transition. Therefore, the probability that the initial nucleotide and the nucleotide at time t differ by a transversion is twice the probability that differ by a transition X (t) + Y (t) + 2Z (t) = 1 58

59 Number of substitutions between two noncoding (NOT protein coding) sequences 59

60 The differences between two sequences are classified into transitions and transversions. P = proportion of transitional differences Q = proportion of transversional differences 60

61 61

62 62

63 63

64 2 V(K) = 1 L P 1 1 2P Q + Q 1 2 4P 2Q Q 2 P 2 1 2P Q + Q 2 4P 2Q + Q 2 4Q 64

65 65

66 Numerical example (2P-model) 66

67 There are substitution tut schemes with more than two parameters! 67

68 Number of substitutions between two protein-coding genes 68

69 Computing the number of substitutions between two protein-coding sequences is more complicated, because a distinction should be made between synonymous and nonsynonymous y substitutions. 69

70 Number of synonymous substitutions Number of synonymous sites Number of nonsynonymous substitutions Number of nonsynonymous sites 70

71 71

72 Aims: 1. Compute two numerators: The numbers of synonymous y and nonsynonymous substitutions. 2. Compute two denominators: The numbers of synonymous and nonsynonymous sites. 72

73 Difficulties with denominator: 1. The classification of a site changes with time: For example, the third position of CGG (Arg) is synonymous. However, if the first position changes to T, then the third position of the resulting codon, TGG (Trp), becomes nonsynonymous. T Trp Nonsynonymous 73

74 Difficulties with denominator: 2. Many sites are neither completely synonymous nor completely nonsynonymous. For example, a transition in the third position of GAT (Asp) will be synonymous, while a transversion to GAG or GAA will alter the amino acid. 74

75 Difficulties with numerator: 1. The classification of the change depends on the order in which the substitutions had occurred. 75

76 Difficulties with numerator: 1. When two homologous codons differ from each other by two substitutions or more the order of the substitutions must be known in order to classify substitutions into synonymous and nonsynonymous. Example: CCC in sequence 1 and CAA in sequence 2. Pathway I: CCC (Pro) CCA (Pro) CAA (Gln) 1 synonymous and 1 nonsynonymous 76

77 Difficulties with numerator: 2. Transitions occur with different frequencies than transversions. 3. The type of substitution depends on the mutation. ti Transitions result more frequently in synonymous y substitutions than transversions. 77

78 Miyata & Yasunaga (1980) and Nei & Gojobori (1986) method 78

79 1. Classification of sites. Consider a particular position in a codon. Let i be the number of possible synonymous changes at this site. Then this site is counted as i/3 synonymous and (3 i)/3 nonsynonymous. 79

80 In TTT (Phe), the first two positions are nonsynonymous, because no synonymous change can occur in them, and the third position is 1/3 synonymous and 2/3 nonsynonymous because one of the three possible changes is synonymous. 80

81 2. Count the number of synonymous and nonsynonymous sites in each sequence and compute the averages between the two sequences. The average number of synonymous y sites is N S and that of nonsynonymous sites is N A. 81

82 3. Classify nucleotide differences into synonymous y and nonsynonymous differences. 82

83 For two codons that differ by only one nucleotide, the difference is easily inferred. For example, the difference between the two codons GTC (Val) and GTT (Val) is synonymous, while the difference between the two codons GTC (Val) and GCC (Ala) is nonsynonymous. 83

84 84

85 For two codons that differ by two or more nucleotides, the estimation problem is more complicated, because we need to determine the order in which the substitutions occurred. 85

86 Pathway (1) requires one synonymous and one nonsynonymous change, whereas pathway (2) requires two nonsynonymous 86 changes.

87 There are two approaches to deal with multiple substitutions at a codon: 87

88 The unweighted method: Average the numbers of the different types of substitutions for all the possible scenarios. For example, if we assume that the two pathways are equally likely, then the number of nonsynonymous differences is (1 + 2)/2 = 1.5, and the number of synonymous differences is (188+ 0)/2 = 0.5.

89 The weighted method. Employ a priori criteria to assign the probability of each pathway. For instance, if the weight of pathway 1 is 0.9, and the weight for pathway 2 is 0.1, then the number of nonsynonymous differences between the two codons is (0.9 1) + (0.1 2) = 1.1, and the number of 89 synonymous differences is 0.9.

90 90

91 4. The numbers of synonymous and nonsynonymous y differences between the two protein- coding sequences are M S and M A, respectively. 91

92 The number of synonymous differences per synonymous site is p S = M S /N S The number of nonsynonymous y differences per nonsynonymous site is p A = M A /N A 92

93 If we take into account the effect of multiple hits at the same site, we can make corrections by using Jukes and Cantor's formula: 93

94 94

95 3 4 M K = ln 1 S S 4 3N S 95

96 3 4 M K = ln 1 A A 4 3NA A 96

97 97

98 Number of Amino-Acid Replacements between Two Proteins The observed proportion of different amino acids between the two sequences (p) is p = n /L n = number of amino acid differences between the two sequences L = length of the aligned sequences. 98

99 99

100 Number of Amino-Acid Replacements between Two Proteins The Poisson model is used to convert p into the number of amino replacements between two sequences (d ): d = - ln(1 p) The variance of d is estimated as V(d) ( ) = p/l (1 p) 100

101 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID ACID SEQUENCES 101

102 102

103 Homology: The term was coined by Richard Owen in Definition: Similarity resulting from common ancestry. 103

104 Homology: A qualitative statment Homology designates a relationship of common descent between entities Two genes are either homologs or not it doesn t make sense to say two genes are 43% homologous. it doesn t make sense to say Linda is 24% pregnant. 104

105 Homology By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor. 105

106 Homology When dealing with sequences, we are interested t in POSITIONAL HOMOLOGY. We identify positional homology by ALIGNMENT. 106

107 ACTGGGCCCAAATC 1 deletion 1 substitution 1 insertion 1 substitution CTGGGCCCAGATC AACAGGGCCCAAATC Correct alignment --CTGGGCCCAGATC AACAGGGCCCAAATC *.*******.********** *** Incorrect alignment CTGGGCCCAGATC-- AACAGGGCCCAAATC...*..*..*.. * * 107

108 Unknown! unknown processes unknown processes CTGGGCCCAGATC AACAGGGCCCAAATC Correct alignment? --CTGGGCCCAGATC AACAGGGCCCAAATC *.*******.********** *** Incorrect alignment? CTGGGCCCAGATC-- AACAGGGCCCAAATC...*..*..*.. * * 108

109 ACCTGAATTTGCCC T9 -A6 G5T -A7 +ACA12 T8A +G2 ACCTTAATTGCACACC ACCTTAATTGCACACC AGCCTGATTGCCC--- AGCCTGATTGCCC C2G, T4C, A6G, A12C, -ACC14 109

110 Alignment: nt A hypothesis s concerning ng positional homology among residues in a sequence. Positional homology = A pair of nucleotides from two aligned sequences that have descended from one nucleotide in the ancestor of the two sequences. 110

111 An alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs: (1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other. GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG ***..*****.*.******* * 111

112 Sequence alignment = The identification of the location of deletion or insertions that might have occurred in either of the two lineages since their divergence from a common ancestor. Insertion +Deletionetion = Indel Indel or or Gap 112

113 Sequence alignment 1. Pairwise alignment 2. Multiple alignment 113

114 - Two DNA sequences: A and B. - Lengths are m and n,, respectively. - The number of matched pairs is x. - The number of mismatched pairs is y. - Total number of bases in gaps is z. 114

115 There are terminal and internal gaps. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG 115

116 A terminal gap may indicate missing data. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG 116

117 An internal gap indicates that a deletion or an insertion has occurred in one of the two lineages. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG 117

118 The alignment is the first step in many evolutionary and functional studies. Errors in alignment tend to amplify in later computational stages. 118

119 Methods of alignment: 1. Manual 2. Dot matrix 3. Distance Matrix 4. Combined (Distance + Manual) 119

120 Manual alignment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. GCG-TCCATCAGGTAGTTGGTGTG GCGTTCCATCAGGTGGTTGGTGTG *** **********.********* 120

121 Advantages of manual alignment: (1) use of a powerful and trainable tool (the brain, well, some brains). (2) ability to integrate additional data, e.g., domain structure, biological function. 121

122 122

123 123

124 Protein Alignment may be guided by Tertiary Structures Escherichia coli DjlA protein Homo sapiens DjlA protein 124

125 Disadvantages of manual alignment: (1) The method is subjective and unscalable. 125

126 The dot-matrix method: The two sequences are written out as column and row headings of a two- dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical. 126

127 The alignment is defined by a path from the upper-left element to the lowerright element. 127

128 There are 4 possible steps in the path: (1) a diagonal step through a dot = match. (2) a diagonal step through an empty element of the matrix = mismatch. (3) a horizontal step = a gap in the sequence on the top of the matrix. (4) a vertical step = a gap in the sequence on the left of the matrix. 128

129 forbidden directions allowed directions 129

130 A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone. 130

131 window size =1 stringency ti = 1 alphabet size = 4 The number of spurious matches is determined by: window size, stringency, & alphabet size. 131

132 window size =1 stringency ti = 1 alphabet size = 4 window size = 3 stringency ti = 2 alphabet size = 4 132

133 window size = 1 stringency = 1 alphabet size =

134 Dot-matrix methods: Advantages: May unravel information on the evolution of sequences. 134

135 Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. 135

136 Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The two diagonally oriented parallel lines most probably indicate that a small internal duplication has occurred in the bacterial gene. 136

137 Dot-matrix methods: Disadvantage: May not identify the best alignment. 137

138 Distance and similarity methods 138

139 The best possible alignment (optimal alignment) is the one in which the numbers of mismatches and gaps are minimized i i according to certain criteria. 139

140 Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa. 140

141 α = matches th β = mismatches γ = nucleotides in gaps δ = gaps 141

142 Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are multiplied to make the gaps equivalent in value to the mismatches. The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions. btitti 142

143 Mismatch penalty is an assessment of how frequently substitutions occur. 143

144 The distance (dissimilarity) index (D) between two sequences in an alignment is D= m y + i i w z k k where y i is the number of mismatches of type i, m i is the mismatch penalty for an i-type of mismatch, z k is the number of gaps of length k, and w k is a positive number representing the penalty for gaps of length k. 144

145 The similarity index (S) between two sequences in an alignment is: S = x + w z k k where x is the number of matches, z k is the number of gaps of length k, and w k is a positive number representing the penalty for gaps of 145 length k.

146 The gap penalty has two components: a gap-opening penalty and a gap-extension penalty. 146

147 Three main systems: (1) Fixed gap-penalty system = 0 gap-extension costs. (2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1. (3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower. 147

148 148

149 Further complications: Distinguishing among different matches and mismatches. For example, a mismatched pair consisting of Leu & Ile, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting ss of Arg & Glu, Gu, which are very dissimilar from each other. 149

150 Lesser penalty than 150

151 Alignment algorithms 151

152 Aim: Find the alignment associated with the smallest D (or largest S) from among all possible alignments. 152

153 The number of possible alignments may be astronomical. For example, when two sequences 300 residues long each are compared, there are possible alignments. In comparison, the number of elementary particles in the universe is only ~

154 There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities. 154

155 The Needleman-Wunsch algorithm uses Dynamic Programming g 155

156 Dynamic programming = a computational ti technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution. 156

157 Multiple l Sequence Alignment 157

158 Alignments can be easy or difficult GCGGCCCA TCAGGTAGTT GGTGG GCGGCCCA TCAGGTAGTT GGTGG GCGTTCCA TCAGCTGGTT GGTGG GCGTCCCA TCAGCTAGTT GGTGG GCGGCGCA TTAGCTAGTT GGTGA ***...** *.**.*.*** ****. TTGACATG CCGGGG---A AACCG T-GACATG CCGGTG--GTGT AAGCC TTGGCATG -CTAGG---A ACGCG TTGACATG -CTAGGGAAC ACGCG TTGACATC -CTCTG---A ACGCG * *.***. *... *. *..*. Easy Difficult 158

159 159

160 Multiple Alignment 2 methods: Dynamic programming (exhaustive, exact) Consider 2 protein sequences of 100 amino acids in length. If it takes seconds to exhaustively align these sequences, then it will take seconds to align 3 sequences, to align 4 sequences...etc. More time than the universe has existed to align 20 sequences exhaustively. Progressive alignment (heuristic, approximate) 160

161 Progressive Alignment Devised by Feng and Doolittle in Essentially a heuristic method and as such is not guaranteed to find the optimal alignment. Requires n-1+n-2+n-3...n-n+1 n n+1 pairwise alignments as a starting point Most successful implementation ti is Clustal l (Des Higgins) 161

162 Overview of Clustal Procedure CLUSTAL Hbb_Human 1 - Hbb_Horse Hba_Human Hba_Horse Myg Whale Quick pairwise alignments 2. Distances for each pair 3. Distance matrix Hbb_Human Hbb_Horse Hba_Human Hba_HorseHorse Neighbor-joining tree 2 (guide tree) Myg_Whale 1 PEEKSAVTALWGKVN--VDEVGG 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 4 AADKTNVKAAWSKVGGHAGEYGA 5 EHEWQLVLHVWAKVEADVAGHGQ Progressive alignment 2 following guide tree 162

163 Clustal Clustal: good points/bad points Advantages: Speed. Disadvantages: No way of knowing if the alignment is correct correct. 163

164 Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone (a) Penalty for gaps is 0 (b) Penalty for a gap of size k nucleotides is w k = k (c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by 164 showing pairs of biochemically similar amino acids

165 An Alignment GCGGCTCA TCAGGTAGTT GGTG-G GCGGCCCA TCAGGTAGTT GGTG-G GCGTTCCA TC--CT-GTT GGTGTG GCGTCCCA TCAGCTAGTT GTTG-G GCGGCGCA TTAGCTAGTT GGTG-A ***...** *.* *** *.****. Spinach Rice Mosquito Monkey Human 165

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Nucleotide substitution models

Nucleotide substitution models Nucleotide substitution models Alexander Churbanov University of Wyoming, Laramie Nucleotide substitution models p. 1/23 Jukes and Cantor s model [1] The simples symmetrical model of DNA evolution All

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Probabilistic modeling and molecular phylogeny

Probabilistic modeling and molecular phylogeny Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU) What is a model? Mathematical

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Evolutionary Analysis of Viral Genomes

Evolutionary Analysis of Viral Genomes University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Introduction to Molecular Phylogeny

Introduction to Molecular Phylogeny Introduction to Molecular Phylogeny Starting point: a set of homologous, aligned DNA or protein sequences Result of the process: a tree describing evolutionary relationships between studied sequences =

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc Supplemental Figure 1. Prediction of phloem-specific MTK1 expression in Arabidopsis shoots and roots. The images and the corresponding numbers showing absolute (A) or relative expression levels (B) of

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics 582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline

More information

The use of molecular tools for taxonomic research in zoology & botany

The use of molecular tools for taxonomic research in zoology & botany The use of molecular tools for taxonomic research in zoology & botany Outline Why employ molecular genetic markers? Brief historical overview of DN research Molecular techniques for genetic analysis DN

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS August 0 Vol 4 No 005-0 JATIT & LLS All rights reserved ISSN: 99-8645 wwwjatitorg E-ISSN: 87-95 EVOLUTIONAY DISTANCE MODEL BASED ON DIFFEENTIAL EUATION AND MAKOV OCESS XIAOFENG WANG College of Mathematical

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF EVOLUTION Evolution is a process through which variation in individuals makes it more likely for them to survive and reproduce There are principles to the theory

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies Richard Owen (1848) introduced the term Homology to refer to structural similarities among organisms. To Owen, these similarities indicated that organisms were created following a common plan or archetype.

More information

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Aoife McLysaght Dept. of Genetics Trinity College Dublin Aoife McLysaght Dept. of Genetics Trinity College Dublin Evolution of genome arrangement Evolution of genome content. Evolution of genome arrangement Gene order changes Inversions, translocations Evolution

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr), 48 3 () Vol. 48 No. 3 2009 5 Journal of Xiamen University (Nat ural Science) May 2009 SSR,,,, 3 (, 361005) : SSR. 21 516,410. 60 %96. 7 %. (),(Between2groups linkage method),.,, 11 (),. 12,. (, ), : 0.

More information

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA SUPPORTING INFORMATION FOR SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA Aik T. Ooi, Cliff I. Stains, Indraneel Ghosh *, David J. Segal

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Sequence Divergence & The Molecular Clock. Sequence Divergence

Sequence Divergence & The Molecular Clock. Sequence Divergence Sequence Divergence & The Molecular Clock Sequence Divergence v simple genetic distance, d = the proportion of sites that differ between two aligned, homologous sequences v given a constant mutation/substitution

More information

CSE 549: Computational Biology. Substitution Matrices

CSE 549: Computational Biology. Substitution Matrices CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

More information

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin International Journal of Genetic Engineering and Biotechnology. ISSN 0974-3073 Volume 2, Number 1 (2011), pp. 109-114 International Research Publication House http://www.irphouse.com Characterization of

More information

Molecular evolution 2. Please sit in row K or forward

Molecular evolution 2. Please sit in row K or forward Molecular evolution 2 Please sit in row K or forward RBFD: cat, mouse, parasite Toxoplamsa gondii cyst in a mouse brain http://phenomena.nationalgeographic.com/2013/04/26/mind-bending-parasite-permanently-quells-cat-fear-in-mice/

More information

Collected Works of Charles Dickens

Collected Works of Charles Dickens Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION Harmandeep Singh 1, Er. Rajbir Singh Associate Prof. 2, Navjot Kaur 3 1 Lala Lajpat Rai Institute

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

More information

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Protein Threading. Combinatorial optimization approach. Stefan Balev. Protein Threading Combinatorial optimization approach Stefan Balev Stefan.Balev@univ-lehavre.fr Laboratoire d informatique du Havre Université du Havre Stefan Balev Cours DEA 30/01/2004 p.1/42 Outline

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Molecular Population Genetics

Molecular Population Genetics Molecular Population Genetics The 10 th CJK Bioinformatics Training Course in Jeju, Korea May, 2011 Yoshio Tateno National Institute of Genetics/POSTECH Top 10 species in INSDC (as of April, 2011) CONTENTS

More information

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Supplemental Table 1. Primers used for cloning and PCR amplification in this study Supplemental Table 1. Primers used for cloning and PCR amplification in this study Target Gene Primer sequence NATA1 (At2g393) forward GGG GAC AAG TTT GTA CAA AAA AGC AGG CTT CAT GGC GCC TCC AAC CGC AGC

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Supplementary Information for

Supplementary Information for Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Introduction to Bioinformatics Pairwise Sequence Alignment Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Outline Introduction to sequence alignment pair wise sequence alignment The Dot Matrix Scoring

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties Lecture 1, 31/10/2001: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties 1 Computational sequence-analysis The major goal of computational

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides hanks to Paul Lewis, Jeff horne, and Joe Felsenstein for the use of slides Hennigian logic reconstructs the tree if we know polarity of characters and there is no homoplasy UPM infers a tree from a distance

More information

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton. 2017-07-29 part 4: and biological inference review types of models phenomenological Newton F= Gm1m2 r2 mechanistic Einstein Gαβ = 8π Tαβ 1 molecular evolution is process and pattern process pattern MutSel

More information

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss Methods Identification of orthologues, alignment and evolutionary distances A preliminary set of orthologues was

More information

Lecture 8 Multiple Alignment and Phylogeny

Lecture 8 Multiple Alignment and Phylogeny Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 8 Multiple Alignment and Phylogeny Multiple Alignment & Phylogeny Multiple Alignment Scoring Complexity

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

Lecture Notes: BIOL2007 Molecular Evolution

Lecture Notes: BIOL2007 Molecular Evolution Lecture Notes: BIOL2007 Molecular Evolution Kanchon Dasmahapatra (k.dasmahapatra@ucl.ac.uk) Introduction By now we all are familiar and understand, or think we understand, how evolution works on traits

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Why do more divergent sequences produce smaller nonsynonymous/synonymous Genetics: Early Online, published on June 21, 2013 as 10.1534/genetics.113.152025 Why do more divergent sequences produce smaller nonsynonymous/synonymous rate ratios in pairwise sequence comparisons?

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Multiple Sequence Alignment. Sequences

Multiple Sequence Alignment. Sequences Multiple Sequence Alignment Sequences > YOR020c mstllksaksivplmdrvlvqrikaqaktasglylpe knveklnqaevvavgpgftdangnkvvpqvkvgdqvl ipqfggstiklgnddevilfrdaeilakiakd > crassa mattvrsvksliplldrvlvqrvkaeaktasgiflpe

More information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 Zn 2+ -binding sites in USP18. (a) The two molecules of USP18 present in the asymmetric unit are shown. Chain A is shown in blue, chain B in green. Bound Zn 2+ ions are shown as

More information