Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are these sequences CTTCCGCGC : : : : : : CTACCTCGA 66%? Convert sequences to polypeptides Now how similar are the same sequences Leu Pro Arg : : : Leu Pro Arg 100%?

When is similarity relevant: How similar must two things be? Redundancy in the genetic code 2nd pos 1st pos U C A G 3rd pos U Phe Ser Tyr Cys U Phe Ser Tyr Cys C Leu Ser Stop Stop A Leu Ser Stop Trp G C Leu Pro His Arg U Leu Pro His Arg C Leu Pro Gln Arg A Leu Pro Gln Arg G A Ile Thr Asn Ser U Ile Thr Asn Ser C Ile Thr Lys Arg A Met Thr Lys Arg G G Val Ala Asp Gly U Val Ala Asp Gly C Val Ala Glu Gly A Val Ala Glu Gly G

When is similarity relevant: How similar must two things be? Structural and chemical similarity Aliphatic/hydrophobic amino acids Acidic amino acids Aspartate Valine Leucine Isoleucine Glutamate amino acids with similar chemical properties may be substituted in semiconserved regions without any discernable effect on function

Similarity When are two sequences similar? Complexity? We have sequenced from person x = AAAAAA & sequenced from person y = AAAAAA Does x = y? Superficially AAAAAA AAAAAA However given several possible scenarios: If x and y come from unrelated genes containing only A s - x is not the same as y! If x and y come from regions of unrelated genes containing a region of only A s with low complexity in this case x is also not the same as y!! Conclusion: similarity can mean different things and the context of the comparison needs to be taken into account!

How can we account for context? Probability of match sequences using two models Related Model :match is not a chance event and a match is always obtained ie p(match) = 1. P(A,A)= p(t,t) = p(g,g) = p(c,c) = 1 P(any other combination) = 0 Now p(match occurring by chance) : p(x,y related) = p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)=1 Random Model (also known as the background model): match is a chance or random event and the likelihood of a match is the product of their base frequency - lets assume: p(nucleotide n)= p(frequency of n) : p(a) = p(t) =0.2, p(g) = p(c) = 0.3 And p(aa)=0.2*0.2=0.04.. Now p(match occurring by chance): p(x,y random) = p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)= 4e -9

So which model do we use? Both: the ratio of probability in the two models provides a measure of likelihood that the sequences match by chance Likelihood Ratio L = p(x,y related) p(x,y random) If P(A,A)= p(t,t) = p(g,g) = p(c,c) = 1, P(any other combination) = 0 p(x,y related) = p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)=1 and p(nucleotide n)= p(frequency of n): p(a) = p(t) =0.2, p(g) = p(c) = 0.3 p(x,y random) = p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)= 4e -9 L=1/0.00406 highly significant However if :p(nucleotide n)= p(frequency of n) : p(a) = 1 & p(t) =p(g) = p(c) = 0 p(x,y random) = p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)* p(a,a)= 1 L=1/1 nonsignificant

How is a related model derived? One related model available which is used in the BLOSUM matrix is derived from the frequency of all possible matches which are found in alignments of related sequences eg. ALIGNMENT OF RELATED SEQUENCES AVGI LKGM AVGI LRGM AVGLLRGM G IGLLKGM GVGLLRGM P(transition m,n) = freq of m,n transitions no of all possible transitions f(a,a)=3, f(a,g)=6, f(g,g)=21.. F(all pos trans)=80 P(A,A related)=3/80, p(a,g related)=6/80, p(g,g related)=21/80.

How is a random model derived? One random model available which is used in the BLOSUM matrix and is usually referred to as the background model is determined by the frequency of nucleotides in transition pairs of sequences derived from the same alignment ALIGNMENT OF RELATED SEQUENCES AVGI LKGM AVGI LRGM AVGLLRGM G IGLLKGM GVGLLRGM p(m) = freq m as a pair total no of pairs f(a in pair) = 3 (AA) +6 (AG), f(g in pair)=21+6.. Total no of pairs = 80 P(A) = 9/80, p(g) = 27/80.

Likelihood Ratio and Matrices For reasons of numerical stability ie addition is faster and more computationally stable than multiplication, the Log 2 of a given likelihood ratio for any transition (m,n) is used and rounded to the nearest integer. Since a*b=log(a)+log(b) LogL Ratio= Log (p(m,n related)/p(m,n background) For each transition possible the LogL ratio is available in a lookup table and has been constructed from an alignment in the BLOCKS database where the conserved sequences aligned have a minimum given identity. Such a set of matrices is referred to as BlosumXX. Where Blosum62 was constructed from an alignment with 62% or more identity. Another is referred to as the PAM matrices where the value of PAM1 is the equivalent evolutionary distance of 1 mutation in 100

Likelihood Ratio and Matrices BlosumXX Part of the Blosum45 matrix A R N D C Q E G H I A 5-2 -1-2 -1-1 -1 0-2 -1 R -2 7 0-1 -3 1 0-2 0-3 N -1 0 6 2-2 0 0 0 1-2 D -2-1 2 7-3 0 2-1 0-4 C -1-3 -2-3 12-3 -3-3 -3-3 Q -1 1 0 0-3 6 2-2 1-2 E -1 0 0 2-3 2 6-2 0-3 G 0-2 0-1 -3-2 -2 7-2 -4 H -2 0 1 0-3 1 0-2 10-3 I -1-3 -2-4 -3-2 -3-4 -3 5 Part of the Blosum62 matrix A R N D C Q E G H I A 4-1 -2-2 0-1 -1 0-2 -1 R -1 5 0-2 -3 1 0-2 0-3 N -2 0 6 1-3 0 0 0 1-3 D -2-2 1 6-3 0 2-1 -1-3 C 0-3 -3-3 9-3 -4-3 -3-1 Q -1 1 0 0-3 5 2-2 0-3 E -1 0 0 2-4 2 5-2 0-3 G 0-2 0-1 -3-2 -2 6-2 -4 H -2 0 1-1 -3 0 0-2 8-3 I -1-3 -3-3 -1-3 -3-4 -3 4 Therefore the LogL Ratio for a given alignment is the sum of the scores for each transition. LogL Ratio= Log p(x 1,y 1 r) + Log p(x 1,y 1 r) + Log p(x 1,y 1 r) +. p(x 1,y 1 b) p(x 1,y 1 b) p(x 1,y 1 b) Since low value BlosumXX matrices are constructed from more distantly related sequences they are better at identifying distant relationships

Additional Scoring Parameters The most common changes which are observed when evolutionary related sequences are compared are multiple base insertions or deletions not single base substitutions An insertion might in reality be a deletion in the corresponding sequence such insertion/deletion events are termed indel events We need to decide if we account for this type of event and if so not just the occurrence of an indel but its size To do this we use additional scoring parameters Gapped Or Ungapped: allow gaps deletion/insertion into the alignment Y R - V L D N M V I Y R V L D V I Y R Y V L D - - V I L M D T S L D L M T S D L M S T S F D

Additional Scoring Parameters Gap Penalty: what penalty score is associated with introducing a gap V L D N T I V L D I V L D - V I Gap extension Penalty: if a gap is extended what score is paid Y R - - - K V L D K N M T I P Y + V L D I P Y K Y V Y S V L D D - - V I P

Gap Extention Penalty The effect of allowing of gaps means that pairings of bases/amino acids may be spread throughout the alignment HSTQHEAGYEARSIGVVLAWHEAE H ----EAG --AW--G -----HE -E Intuitively we would expect conservation to be grouped into domains and not spread throughout the alignment, since the most common rearrangements of gene sequence are insertions and deletion HSTQHEAGYEARSIGVVLAWHEAE ----HEAG ----------AWGHEE This effect is achieved by introducing a gap extension penalty such that not only is a penalty applied for starting a gap but also adding to an existing gap

Gap Extention Penalty If gap penalty = -8, EE = 5, HH = 8, GG = 6, RW = -3, AA = 4 Best alignment with no gap extension penalty HSTQHEAGYEARSIGVVLAWHEAE H ----EAG --AW--G -----HE E No gap extension penalty 40+8+5+4+6-3+6+8+5+5 = 4 gap extension penalty = -2 now has a score of 40-18+8+5+4+6-3+6+8+5+5 = -14 Using an alternative alignment If gap penalty = -8, gap extension penalty = -2, WW = 11, HG= -2, HE = 0, AE = -1 With a gap extension Now the best alignment is -16-(2x13)+8+5+4+6+4+11-2+0-1+5 = -2 HSTQHEAGYEARSIGVVLAWHEAE ----HEAG ----------AWGHEE

How do we find the best GLOBAL alignment? Needleman-Wunsch or dynnamic global alignment algorithm Two sequences each of 200 bases, gaps allowed how many possible alignments? 2 2n =7.28x10 118 2πn I P E E I L G K I I P E D I L G E I These are most simply represented in a m x n matrix. The diagonal represents the alignments without insertions or deletions in either sequence. The horizontal and vertical represents insertions or deletions in one of the two sequences. The alignment starts in the upper left row/col and extends the whole length of the diagonal The combination of these movements within the matrix, therefore represent all possible alignments

Calculating the score for any given GLOBAL alignment The Dynamic Global alignment Algorithm Y R K V L D K N M I P V P I L G R I A number of choices need to be made at the start Gapped or ungapped? Gapped What gap penalty do we set if any? 8 What gap extention penalty if any? 0 What Matrix is to be used for scoring? Blosum45 I P V P I L G R I 0-8 -16-24 -32-40 -48-56 -64-72 Y -8 R -16 K -24 V -32 L -40 D -48 K -56 N -64 M-72 Step 1 1. Fill in the initial gap penalties

Calculating the score for any given GLOBAL alignment The Dynamic Global alignment Algorithm 2 1. 1. Add score for match (row i, col j ) to upper left diagonal (row i-1,col j-1 ) Step 2. filling in each cell 3. Or Add cost of gap (-8) to score in cell above (row j-1, Col j ) 2. I P 0-8 -16 Y -8 0+0 R -16 I P 0-8 -16 Y -8-8-8 R -16 2. Or Add cost of gap (-8) to score in cell to the left (row i, Col j-1 ) 4. Enter highest score for (row i, Col j ) out of these values, noting which this cell value came from 3. I P 0-8 -16 Y -8-8-8 R -16 4. Note direction from which Max comes I P 0-8 -16 Y -8 0 R -16

Calculating the score for any given GLOBAL alignment The Dynamic Global alignment Algorithm 3 Step 3. Fill out remaining scores noting source of max - I P V P I L G R I - 0-8 -16-24 -32-40 -48-56 -64-72 Y -8 0-8 -16-24 -32-40 -48-56 -64 R -16-8 -2-10 -18-26 -34-42 -41-49 K -24-16 -9-4 -11-19 -27-35 -39-44 V -32-21 -17-4 -7-8 -16-24 -32-36 L -40-29 -24-12 -7-5 -3-11 -19-27 D -48-37 -30-20 -13-11 -8-4 -12-20 K -56-45 -38-28 -21-16 -14-10 -1-9 N -64-53 -46-36 -29-23 -19-14 -9-3 M -72-61 -54-44 -37-27 -21-21 -15-7 Step 4. Trace the back the path from the bottom right hand cell and extract the alignment from the path I PV P I LG R I - Y - RKV LDKNM

Local vs Global alignments The Needleman-Wunsch algorithm aligns two sequences over their entire length, however, this has distinct disadvantages. 1. when trying to identify common functional domains of distantly related sequences such as orthologues 2. Comparing partial or overlapping sequences where only part of the common sequences may be present such as comparing ESTs In such cases it is desirable to do local alignments ie find regions of sequence similarity amongst regions of disimilarity. This is achieved by the Smith-Waterman algorithm

How do we find the best Local alignment? Smith-Waterman or dynnamic local alignment algorithm I P E D I L G E I Just as before local alignments are most simply represented in a m x n matrix I P E E I L G K I Again the diagonal represents the alignments without insertions or deletions in either sequence. The horizontal and vertical represents insertions or deletions in one of the two sequences. The combination of these movements within the matrix therefore represent all possible alignments The local alignments may start anywhere within the matrix and need not cover the entire diagonal

Again Additional Scoring Parameters Gapped Or Ungapped: allow gaps deletion/insertion into the alignment Y R - V L D N M V I Y R V L D V I Y R Y V L D - - V I Gap Penalty: what penalty score is associated with introducing a gap V L D N T I V L D I V L D - V I L M D T S L D L M T S D L M S T S F D Gap extension Penalty: if a gap is extended what score is paid Y R - - - K V L D K N M T I P Y + V L D I P Y K Y V Y S V L D D - - V I P

Calculating the score for any given Local alignment The Dynamic Local alignment Algorithm Y R K V L D K N M I P V P I L G R I Just as before number of choices need to be made at the start Gapped or ungapped? Gapped What gap penalty do we set if any? 8 What gap extention penalty if any? 0 What Matrix is to be used for scoring? Blosum45 I P V P I L G R I 0 0 0 0 0 0 0 0 0 0 Y 0 R 0 K 0 V 0 L 0 D 0 K 0 N 0 M 0 Step 1 1. Fill in the initial gap penalties although the gap penalty is 8 any negative value is substituted for 0

Calculating the score for any given LOCAL alignment The Dynamic Local alignment Algorithm 2 1. 1. Add score for match row i,col j ) to upper left diagonal (row i-1, col j-1 ) Step 2. filling in each cell 3. Or Add cost of gap (-8) to score in cell above (row i-1, col j ) 2. I P 0 0 0 Y 0 0+0 R 0 I P 0 0 0 Y 0-0-8 R 0 2. Or Add cost of gap (-8) to score in cell to the left (row i, col j-1 ) 4. Enter score for (row i, col j ) as the largest of these values, unless max < 0 if so enter 0 noting which this cell the value came from 3. I P 0 0 0 Y 0-0-8 R 0 4. Note direction from which Max comes I P 0 0 0 Y 0 0 R 0

Calculating the score for any given LOCAL alignment The Dynamic Local alignment Algorithm 3 Step 3. Fill out remaining scores noting source of max I P V P I L G R I 0 0 0 0 0 0 0 0 0 0 Y 0 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 7 0 K 0 0 0 0 0 0 0 0 3 4 V 0 3 0 5 0 3 1 0 0 6 L 0 2 0 1 2 2 8 0 0 2 D 0 0 1 0 0 0 0 7 0 0 K 0 0 0 0 0 0 0 0 10 2 N 0 0 0 0 0 0 0 0 2 8 M 0 2 0 1 0 2 2 0 0 4 Step 4. Trace the back the path from the highest scoring cells and extract the alignment(s) from the path(s) I LG R V LDK

Blast and FastA tools Two sequence comparison search tools are commonly used to compare sequences within sequence databases Blast and FastA The basic algorithms used are the similar, however, each uses initial filter to approximate the results to a full Smith-Waterman search algorithm (something refered to as heuristics) Word size: no of minimum identical matches required Window size: maximum no of insertions or deletions allowed within a given spacing There are many other pairwise sequence comparison tools such as PSI-blast and DIALIGN and versions of these

Expectation / Probability Two measures are often used to gauge the importance of a given alignment A P-value or the probability representing the likelihood of obtaining the given alignment by chance. So that a P(n) of 0.0001 means that you would expect a similar alignment to be found for every 10,000 sequences searched i.e. the probability is 1 in 10,000. An E-value, this represents the number of similar alignments you would expect to retrieve in the database with the same alignment purely by chance. Sometimes a normalised E-value is given E() this is essentially an expected normsalised for the size of the data base and is the same number as the probability.

Expectation / Probability The P-value is calculated from the score using appropriate probability density function of the frequency of scores obtained from alignments of randomised sequence. The E-value is related to the size of the database searched and is calculated as Pxno of Database entries Blast Score 310, P=1x10-8, E= 0.008 Blast Score = 31, p= 0.003, E= 2400

Multiple Sequence Alignment Evolution gene divergence & duplication Over an evolutionary timescale organisms diverge and as the evolutionary distance increases then the sequence of any given protein drifts subject to evolutionary pressure for either change or conservation of given residues. The conservation of sequence is critical only in certain regions such as functional domains, domains required for the structural integrity of the protein or for binding of ligands. Sometimes as divergence occurs some genes may be duplicated and whilst the maintenance of one copy is functionally necessary the second is not and this duplicated gene may also diverge where it is evolutionarily advantageous and eventually acquire a different but related function. The similarity between such homologous genes is often sufficient to be able to detect such a relationship by sequence comparison. Such similarities are detected not by paired comparison of sequences but multiple sequence alignments using tools such as CLUSTALW

How do we identify sequence conservation within families Identification of the conserved regions within a family of proteins is derived from multiple sequence comparisons from putative family members By adding more and more sequences to an alignment its possible to identify which of the conserved residue are critical to function

Evolutionary relationship between sequences results in conservation within families A useful way of displaying information about how related sequences are is as a cladogram or phylogram. This is basically a branch system connecting related data. In the phylogram, the length of the branch indicates relatedness within a phylum ie reflecting the evolutionary change. A cladogram shows the relationship within the clade reflecting the common ancestry but not as a measure of evolutionary time. In either case the shorter the distance to a common branch the more closely related the sequences.

Summary Points 1. Comparison of two sequences is most simply carried out by bringing the two sequences into register such that the maximum number of identical pairs may be counted, effectively scoring by identity Why are identity matrices not used for alignments of proteins? 2. The presence of indels in one or both of sequences make it is necessary to introduce of gaps, however, in order to obtain contiguous pairing in the alignment it is necessary to incorporate penalty scores for introducing and extending GAPs forcing matched pairs to be grouped together 3. More distant evolutionary relationships require a scoring system which account for similarity and not just identity, such scoring matrices include BLOSUM and PAM Why is conservation likely to be observed in the functional domains of related proteins such as orthologues and conversely why might we expect to see more differences in regions that don t contribute to function but do play a role in the protein scaffold What is the basis of BLOSUM and PAM matrices

Summary Points 4. Comparisons can illustrate different properties in the relationship between two Sequences which may be represented across the entire sequences requiring global comparisons using the Needleman-Wunsch algorithm or relationships which may be limited to localised regions such as functional domains requiring analysis using Smith-Watermann algorithm Global alignment might be good at highlighting small differences between to sequences or identifying introns when comparing genomic and mrna sequences but why might it be poor at aligning two distantly related or partially overlapping sequences 5. Sequence comparison with the contents of a database is effectively multiple pairwise comparisons however, in order to make the search sufficiently rapid, whilst remaining sensitive the search tools (BLAST & FASTA) pre-filter the possible alignments seeding the start of the alignments to regions with a given amount of identity 6. An alignment of two sequences is not proof of a relationship and statistical values are used to indicate a level of confidence attached to the alignment. These are given as Log likelihood Scores, probability (P) and expected frequency (E) values What are P and E-values and how are they calculated

Summary Points 7. The sequence of any given gene drifts subject to evolutionary pressure for either change or conservation of given residues. The conservation of sequence is critical only in certain regions. What critical regions of a protein might be conserved and why? 8. ClustalW is a program used to produce multiple sequence alignments and JalView is a specialised program for view the output from such alignments To what end might you use multiple sequence alignment 9. There are several ways to visualise the results of an alignment one very useful way of summarising the information is a branch system connecting related sequences such as a cladogram or phylogram. 10. In the phylogram, the length of the branch indicates relatedness within a phylum ie reflecting the evolutionary change between sequences 11. A cladogram shows the relationship within the clade refecting the common ancestry. In either case the shorter the branch the more closely related the sequences.